Smart grid public datasets: Characteristics and associated applications

The development of smart grids, traditional power grids, and the integration of internet of things devices have resulted in a wealth of data crucial to advancing energy management and efficiency. Nevertheless, public datasets remain limited due to grid operators' and companies' reluctance to disclose proprietary information. The authors present a comprehensive analysis of more than 50 publicly available datasets, organised into three main categories: micro ‐ and macro ‐ consumption data, detailed in ‐ home consumption data (often referred to as non ‐ intrusive load monitoring datasets or building data) and grid data. Furthermore, the study underscores future research priorities, such as advancing synthetic data generation, improving data quality and standardisation, and enhancing big data management in smart grids. The aim of the authors is to enable researchers in the smart and power grid a comprehensive reference point to pick suitable and relevant public datasets to evaluate their proposed methods. The provided analysis highlights the importance of following a systematic and standardised approach in evaluating future methods and directs readers to future potential venues of research in the area of smart grid analytics.


| Contributions
The rapid transition of the power sector towards more sustainable and efficient smart grid systems, enhanced by the integration of IoT technologies, has resulted in a complex and data-rich environment.This work is motivated by the pressing need to guide researchers through this intricate landscape.By offering a comprehensive review and comparative analysis of smart grid datasets, we seek to simplify dataset selection for specific applications.This paper makes several important contributions to the field of electrical grid research, data analysis, which are concisely outlined as follows: � Offers an extensive and systematic review of data sources in the electrical grid domain, encompassing smart metre datasets, NILM datasets, and grid datasets.This review emphasises publicly available data and facilitates the identification of relevant datasets for specific research questions or analyses, addressing the challenge of selecting the most suitable data sources in a rapidly evolving field.� Analyses the features and characteristics of SG datasets, elucidating their applications and relevance in various research contexts, including IoT-based energy management solutions.A comparative analysis of the features, strengths, and weaknesses of various datasets is presented, enabling researchers to make informed decisions when selecting appropriate data sources for their studies.� Examines the preprocessing methodologies, feature engineering techniques, and evaluation procedures employed by researchers fostering a deeper understanding of best practices in the field.This also aims to mitigate potential pitfalls in the utilisation and handling of diverse datasets, promoting a more robust and rigorous approach to research in IoTdriven SG systems.

| Previous work
In this section, we examine notable literature reviews and surveys in the domain, providing a concise overview of the existing knowledge in this field.
The work in ref. [11] investigates the SG architecture for the study of software reliability engineering.The article cites and discusses the characteristics of 15 datasets, which can be used for reliability engineering, and divides them into three main categories: Loss of loading probability, power distribution, and hardware.However, the article does not offer a detailed analysis of these datasets and their characteristics.
The comprehensive study by ref. [12] presents 13 consumer datasets and their characteristics while exploring deep learning techniques applied to load analysis, forecasting and management systems.The challenges associated with implementing deep learning techniques are discussed as well as potential solutions to enhance performance.Furthermore, the authors identify five open research issues concerning the future of SGs.In a related review paper, ref. [13] focuses on data analytics applications of smart metre data, featuring 10 datasets with general characteristics (e.g.number of records, frequency and duration) and corresponding references.Although both reviews contain useful information, neither delves into extensive detail about these public datasets, which would be beneficial for researchers seeking suitable datasets for their studies.
Iqbal et al. [14] provide a comprehensive review of 42 NILM datasets, detailing their characteristics and statistical information.However, the authors do not discuss NILM applications or reference research articles that utilised these datasets.In contrast, the study in ref. [15] reviews several NILM datasets and their characteristics, while also mentioning the types of NILM approaches they permit, such as eventbased or event-less methods.Despite these insights, the review does not elaborate on how the datasets were used or the specific techniques that were applied.
In the work of ref. [16], the models of the tools and the datasets that can be used to operationalise local energy communities in practice were reviewed.The reviewed use cases are of interest to stakeholders but do not specify particular applications of the data.The mentioned datasets consist of demand-side data and climate-related data, with specified characteristics.However, the specific uses of these datasets were not referenced.
The review paper [17] discusses publicly available distribution and transmission grid datasets, detailing their characteristics and intended usage.However, the work does not provide examples of research efforts that demonstrate the practical application of these datasets.In contrast, the authors in ref. [18] focus on publicly available test distribution networks with features in the United States, characterising them and identifying their use cases.Although providing valuable information, its scope is limited to public grid datasets with US features, leaving a broader perspective unexplored.
A comparison of the review articles and the contribution of our article is provided in Table 3.

| Methodology
The methodology followed is to construct three different comprehensive search strings for each data types.We used six major search libraries namely IEEE Xplore, ScienceDirect, Wiley Online Library, SpringerLink, MDPI, and ACM digital library.The search string used for macro and micro-level consumption data is (("smart metre" OR "energy consumption" OR "system level" OR "substation") AND ("smart grid" OR "power grid") AND ("public dataset" OR "publicly available") AND ("dataset")) and it returned a total of 275 articles.For the second type, which is, detailed in-home consumption data we used the search string (("Buildings" OR "Non-intrusive load monitoring" OR "NILM") AND ("public dataset" OR "publicly available")) which returned 250 articles.
Finally, for grid datasets we used the following search string (("grid dataset" OR "test system" OR "benchmark grid" OR "Representative grid" OR "Generic grid") AND ("smart grid" OR "power grid")) which returned 501 articles.In addition to the public datasets published, the search strings return research articles that utilised public datasets to evaluate their proposed approaches.Since datasets are still relevant regardless of the time they were released we opted to keep all articles regardless of the year the articles were published.Lastly, irrelevant articles are excluded based on the title and abstract of the article.The new numbers of articles for Macro and Micro-level datasets, detailed in-home consumption datasets, and grid datasets were 192, 149 and 320 articles respectively.The datasets used for evaluation by the articles that remained in the final set were then extracted.The dataset characteristics and associated applications were then extracted from the datasets' meta data and articles that utilised the datasets for various applications.
The rest of this paper is organised as follows, Sections 2-4 discuss micro and macro consumption data (which includes both smart metre and system-level data), detailed in-home consumption data and grid data, respectively.Each section discusses the applications, public datasets, and reviews the literature on the most popular datasets in each category.Sections 5 and 6 highlight data issues and conclude the paper.A graphical representation of the structure of the paper is shown in Figure 1.

| MACRO AND MICRO-LEVEL CONSUMPTION DATA
Smart metre data is the most commonly utilised in SG analytics, with a wide range of applications.They typically record data at intervals of 10-60 min.Smart metre data can significantly improve grid efficiency and long-term viability by providing valuable information on energy consumption, electrical measurements (e.g.voltage, current, power factors) [19,20], metrerelated issues, and outage data.Metre data management systems maintain records about each metre, such as its status, manufacturer, installation date, malicious behaviour, and reconfiguration data, as well as circuit installation locations and service point details.Service points represent the interface between the utility supply and a site's wiring system.Furthermore, customer account information, including contracted power, type, status, irregularities history, and billing data, can be leveraged for load forecasting by clustering similar customers [21].A comprehensive summary of this information can be found in Table 1.Pablo et al. concluded in their work on 311K customers in Uruguay that complementary customer information and geolocalisation complement the consumption signal and are relevant features [22].While smart metre data provides granular insights and allows for nuanced interventions and measures, system-level or macro-data carries its own significance.Macro data provides a holistic view of consumption patterns in larger sections of the grid and are critical for high-level planning, management, and forecasting [23].Conversely, micro or smartmetre data offers detailed load profiles of individual households, presenting opportunities for customised energy efficiency strategies and DR programs.However, despite differences in scale and granularity, many applications, such as load forecasting, anomaly detection, and load management, incorporate an overlap between macro and micro data.For example, while load forecasting at the micro level informs individual household energy management strategies, at a macro level, it aids in power generation planning and grid stability measures.The methodologies developed for these applications can often be applied interchangeably between the two scales, although with adjustments to account for the inherent differences.
Therefore, given the considerable overlap in applications and in order to maintain coherence and efficiency in our presentation, we have elected to group both the smart metre (micro-level) and system-level (macro) datasets under the umbrella of "Macro and Micro-Level Consumption Data".This arrangement streamlines the discussion, eliminates redundancy, and underscores the interconnected nature of data analysis at different scales within the context of the smart grid.
This section discusses the most popular applications and public datasets and their characteristics.

| Bad data detection
Bad data detection, or anomaly detection, is a crucial preprocessing technique that improves data quality and the accuracy of models and analytics by handling missing values or correcting/removing outlier data.Smart metre data are time series data, so existing techniques for time series data can be applied.However, traditional short-term load forecasting (STLF) methods for imputing bad data have limitations [40].Probabilistic approaches also face challenges in determining optimal rejection thresholds, especially for large datasets [41][42][43].
There is a lack of publicly available datasets with labelled ground-truth fine-grained anomalies for the SG context, except for EnerNOC [44], which has a limited number of anomalies.Anomaly detection work is typically divided into two steps: defining or injecting synthetic anomalies and implementing an anomaly detection technique.Imputation is important when dealing with missing values, considering the rate of missing values and the cause of failure.
In ref.
[45], the authors used Prophet by Facebook to define anomalies and evaluated classification models in the Ausgrid residential dataset.The best performance was achieved using the Random Forest classifier.In ref. [46,47], the authors used modified generative adversary networks (GANs) and removing variational autoencoder-based techniques to impute missing values and anomaly detection, respectively, on the GEFCom 2014 dataset [48].

| Non-technical loss detection
Non-technical loss detection focuses on identifying discrepancies between energy injected and electricity paid for.It is closely related to anomaly detection but evaluates users' load profiles in the same neighbourhood or previous benevolent profiles to detect anomalies [30].The only publicly available known labelled dataset for this purpose is the State Grid Corporation of China (SGCC) dataset [49].Another approach focuses on detecting abnormal behaviours in a private manner, such as the work in ref. [50].

| Load profiling
Load profiling aims to understand users' or groups' typical patterns of electricity use, which is valuable for DR programs and prospective load forecasting [32].Load profiling helps to better comprehend socio-demographic factors and target potential consumers for DR programs.Another research direction explores the development of privacy-preserving techniques and integrity assurance mechanisms for load profiling in SGs, with the goal of safeguarding sensitive smart metre data and maintaining the accuracy of outsourced data analytics processes [51].

| Load forecasting
Load forecasting is essential in the electric power industry for operations, planning, pricing, procurement, and hedging decisions.Load forecasting can be long-term or short-term, with different use cases for each.Preprocessing techniques in load forecasting include smoothing and imputation, feature extraction and selection, and clustering [52].Various techniques are used, such as artificial neural networks, time series analysis, bottom-up approaches, SVM, and regression.

| Load management
Load management can provide better and more personalised services by collecting sociodemographic information [53].Key aspects include customer base load estimation and tariff design.Customer base load estimation evaluates the effectiveness of DR programs by estimating load profiles without the program.
The literature is categorised into similar-day methods, regression-based methods, and morning-consumptionadjustment methods [54].New approaches using highfrequency data, such as clustering-based methods, improve performance.Tariff design, on the other hand, is essential in balancing consumer response and utility provider profits.Clustering consumers is an important first step, followed by solving optimisation problems based on each cluster's load profiles [55].The real-time price determination problem aims to maximise profits for SG retailers [56].Price bidding in the SG plays a crucial role in demand side management by allowing consumers to participate in electricity markets actively.By submitting price bids for electricity use, consumers can influence the market price of electricity, encouraging energy savings and peak load reduction.This interactive process not only empowers consumers but also helps stabilise the grid by aligning energy usage with real-time supply and demand conditions [57].

| Smart metre and system-level datasets
In this section, we present a comprehensive review of all public smart metre and/or region datasets identified in the literature to the best of our knowledge at the time of writing this article, examining their characteristics, features, associated applications, and privacy considerations.The information is summarised in Tables 4 and 5. Table 4 summarises the datasets commonly used in the literature for certain applications.

| Low Carbon London
Low Carbon London (LCL) dataset [58] is an open dataset that involved 5567 consumers.Dynamic time-of-use tariffs was applied on 1122 of the consumers as part of an experiment carried out over the year 2013.The data set consists of the following: 1) Energy consumption (in kWh) sampled from smart-metres at 30 min frequency for each consumer.Data were collected for a total of 12 months during the experiment (i.e. when the dynamic time-of-use tariffs was in effect), in addition to 6 months before and 2 months after the experiment period.2) Appliance survey that includes information such as number of appliances, physical parameters of the household (e.g.insulation, number of rooms etc.) and basic details of the occupants (e.g.number of occupants, age categories etc.).Data include 990 records from the group that opted for the dynamic time-of-use tariffs, and 1870 from the group that did not.3) Attitudes survey to assess the change in consumption behaviour of the group that opted for the experiment, such as the factors that made them more likely to change their behaviour.Seven hundred fourteen records were received.
The privacy of consumers was preserved by doing the following: 1) Identifying information such as names, locations, and addresses was omitted.2) ID keys were generated randomly.
3) The surveys were manually checked for any inadvertent inclusion of personal details.
With the help of historical data prior to the implementation of DR programs, some baseline load estimation algorithms are developed on this dataset [54].The high frequency data also led to works in long and STLF as well [71,72].
T A B L E 4 Applications and corresponding public datasets.

| UMass smart*
The UMass Smart* [60] dataset is a collection consisting of the following 9 subsets: � DeepRoof dataset: Satellite images of building roofs and the planar segmentation of each.� Apartment dataset: Aggregated energy consumption of 114 single family apartments for the period of 2014-2016, together with their associated weather data.Readings were sampled once per minute.� Home dataset (2017 release): The aggregated and individual circuit consumption of 7 households collected at a per minute interval over multiple years.� NIOM (Non-intrusive-occupancy-monitoring) dataset: aggregated consumption at minute level for a 3 week period for two households of two occupants each, with the ground truth occupancy status.� Home dataset (2013 release): This dataset focuses on depth instead of breadth.That is, only three houses were monitored.However, the data included information about consumption (per circuits and aggregate, individual metres, dimmable and non-dimmable switches), two electrical phase data (voltage and frequency), environmental data (indoor and outdoor), oven and door status, energy generation data (solar panels, wind, and battery voltage), and motion detector data.The dataset also includes micro-grid dataset of 443 homes over a single-day period.
� Solar-TK: contains solar energy generation data from 81 homes in the US.� Solar panel: includes 50 rooftop solar panels energy generation data at a 1-min interval.� SunDance: includes 1 year's data of 100 solar sites in North America in 2015.Net metre, solar generation, and weather data were collected at a frequency of 1 sample per hour.� Physical-Black-Box Model: This dataset includes weather and normalised solar generation data to build the physical black-box model.The files also include code to model shading effects.
Applications of this dataset include a privacy-preserving architecture developed in ref. [78] while still meeting the utilities' needs to achieve a net metering goal.The solution uses the concept of Zero-Knowledge proofs and provides cryptographic guarantees for the integrity, authenticity, and accuracy of payments, while permitting changeable pricing without disclosing the power measurements acquired throughout a billing period.NILM and NIOM algorithms development can be done on this dataset, because of the availability of circuit and appliance level consumption data, as well as the ground truth for detecting occupancy status [79].Solar panel data can also be used as auxiliary data for distribution grid management algorithms such as voltage regulation as in ref. [80].The high frequency polling of data per minute also prompted some researchers to study data compression algorithms such as the work done in ref. [81].

| Ausgrid distribution network: residential and substations
The Ausgrid distribution network records and publishes four types of datasets [61]: � Electricity consumption: Ausgrid has grouped the yearly residential and non-residential electricity consumption data by local government areas (LGA), total of 32 areas, in its distribution system.High-voltage customers and supply services such as public lighting and bus shelters are not included in these data.� Solar panels and electricity consumption: A sample of 300 solar customers from Ausgrid's electricity network area was randomly selected, all of whom were billed on the domestic tariff and possessed a gross metered solar system throughout the duration from 1 July 2010 to 30 June 2013.
To compile the data, metre reading processes were employed to obtain a comprehensive dataset of actual electricity consumption and production at half-hour intervals for the selected customers during the specified period.Customers who fell at the extremes of household consumption and solar generation performance during the first year of the study were excluded.Solar homes with rooftop solar systems connected to the grid through a gross metering configuration account for 2657 of the monthly Energy management solutions such as storage and DER scheduling and customer baseline load estimation can be implemented in the solar panels and electric consumption (residential) dataset.Electricity consumption, solar panel generation, and net demand forecasting can also be implemented in this dataset.For the Ausgrid substation and past outages datasets, aggregated load forecasting, and demand side management (e.g.planning of charging infrastructure [82] and electrical distribution system planning), and modelling equipment failure (e.g.power transform failures and retirement statistics [83]) are the most common applications.Customers in these datasets have been deidentified and do not represent a statistically significant sample of residential customers in the Ausgrid network area, nor have they been subjected to detailed occupancy checks.

| Customer behaviour trials
Customer behaviour trials (CBT) dataset [62] consists of 5375 households electricity consumption data recorded every half an hour for the span of 18 months.The data was collected by the Commission for Energy Regulation of Ireland.The objective of the CBT dataset is to evaluate smart-metre technology timeof-use tariffs and different demand side strategies.Therefore, the data was divided into two phases: the benchmark period (6 months) and the test period (12 months).In the trial, four different groups were assigned different time of use tariffs.A survey (of 143 questions) on household characteristics is also included.The survey aims to depict the socio-demographic characteristics of the household; employment status, household size, age, and the social class.Given that consumers were incentivised to change their behaviours through demand-side strategies, the authors believe that the dataset is a good benchmark to develop.Concept drift aware algorithms.As reported in the study, 82% reported making some changes in their consumption patterns and 74% reported drastic changes in their households.The trial reported noticeable drastic changes in 38% of the consumers.

|
The state grid corporation of China SGCC [70] released the daily electricity consumption of 42,372 consumers in the period from 1 January 2014 to 31 October 2016 for a total of 1035 days (with a consumption reading per day).The dataset is also labelled for malicious activities for a total of 3615 thieves.The other 38,757 consumers are labelled as honest consumers.Labelling electricity theft acts as the ground truth to evaluate models.

| Independent system operator New Englad
Every month since 2003, the independent system operator (ISO) New England publishes [68] system-level hourly load data, as well as corresponding temperature data, regional location prices, market clearing prices and interchanges with other power systems for 9 different zones.The market data allows for studies in power market design [84], price bidding [85], as well as price forecasting.

| Australian energy market operator
The Australian Energy Market Operator (AEMO) [63] serves as the main entity responsible for overseeing the management and operation of electricity and gas networks, as well as price determination, in five states in Australia.This organisation maintains a comprehensive dataset that includes aggregated demand data and electricity price data for these states, with temporal granularity provided at a half-hourly rate.However, it should be noted that beginning in November 2021, the resolution of these data experienced a significant enhancement, with the frequency of data points increasing to every 5 min.Data have been updated and available since 1998.The research done on the datasets focused mostly on STLF.However, some descriptive analysis work was also done.For example, in ref. [86], the effects of wind and solar panel generation on wholesale electricity prices were studied.The authors in ref. [87] used the dataset to design the optimal battery capacity of solar panels.The authors also simulated the hourly generated solar panel power and made it public [88].

| Electric reliability council of Texas
The Electric Reliability Council of Texas (ERCOT) [64] is an ISO responsible for overseeing the state's electrical transmission and distribution network, serving over 25 million customers.Since its inception in 2001, ERCOT has managed the deregulated wholesale electricity market and has provided various datasets to the public, including real-time and dayahead market data, transmission and generation data, and renewable energy data.These datasets encompass energy prices, demand, and generation capacity for the entire ERCOT region, divided into four load zones.Access to ERCOT datasets requires a submission request through their website.These datasets have been utilised for various purposes, such as STLF and price forecasting (e.g.ref. [89]).

| Global energy forecasting competition 2012 (GEFCom2012)
The Global Energy Forecasting Competition 2012 (GEF-Com2012) [67] is a hierarchical load forecasting contest for a utility located in the United States with 20 zones (from 1 July 2003 to 30 June 2008).The dataset includes temperature data from 11 weather stations and the holidays of that time period.The authors in ref. [67] reviewed the winning solutions.

| The building data genome project
The Building Data Genome Project [65] consists of 507 whole (non-residential) building electrical metres data from February 2014 to April 2016, most of which come from buildings on university campuses.The dataset also includes different distinctive meta-data such as gross floor size, primary use type, and meteorological information.The dataset was developed primary to test various algorithms and feature extraction techniques.Such use cases include load forecasting, load shape/profile clustering, and synthetic load data creation [90] and inference of buildings' characteristics [91].

| Energy market authority of Singapore
The Energy Market Authority of Singapore publishes numerous statistics pertaining to the grid operation [66], notably the system demand data polled at 30-min intervals since the beginning of 2004 and historical market prices.The datasets were used for reliability analysis [92,93], descriptive analysis (for example, analysis of customers responding to socioeconomic determinants [94]) and demand-side bidding [95].

| EnerNOC
EnerNOC [44] collected 5-min energy consumption data for 2012, for 100 commercial/industrial sites.EnerNOC is the only dataset that we are aware of that has labelled anomaly data, which can be used as the ground truth for developing anomaly detection algorithms.However, further examination revealed that the number of anomalies in the dataset is arguably negligible (no more than 11 instances out of more than 100,000 readings).For privacy, the real measurement values and identifying information such as geolocations and floor area have been anonymised.However, the values were shifted on a linear scale to ensure consistency of the comparison over time and across sites.The data set has been used in the design of energy storage systems [96,97] and load forecasting [98].

| UCI ElectricityLoadDiagrams20112014
The ElectricityLoadDiagrams20112014 is a real-world dataset from Portugal [69].The dataset has a resolution of 4 readings per hour from 2011 to 2014 for 370 customers.The dataset includes residential and commercial buildings and consumers.

| Discussion
While datasets from the UK, US, Australia, Ireland, and Portugal provide valuable insights into energy consumption, the majority originate from developed nations in the Northern Hemisphere.This overrepresentation may limit the applicability of research outcomes to the diverse energy landscapes of developing countries, particularly those in the Southern Hemisphere, where different economic, infrastructural, and climatic conditions prevail.The geographic concentration of these datasets suggests potential limitations in the generalisability of research findings.The distinct energy consumption patterns, regulatory frameworks, and customer behaviours specific to the United States may not be directly transferrable to other global contexts.This limitation underscores the need for a more diverse compilation of datasets that encapsulate the variegated nature of energy systems across different regions and cultures to truly harness the universal applicability of smart grid analytics.The issue is particularly relevant in developing countries, where infrastructural, economic, and policy differences shape distinct energy dynamics.Emerging markets often prioritise expanding energy access, diverging from patterns found in more developed nations.Consequently, a more inclusive dataset collection is imperative for globally relevant smart grid analytics.
Additionally, the analysis of dataset utilisation across different applications as demonstrated in Figure 2 reveals a pronounced emphasis on load forecasting, particularly within system-level datasets that benefit from frequent updates.This trend aligns with the critical role that load forecasting plays in the operational planning and reliability of the electrical grid.Load forecasting's predominance in the literature is indicative of its foundational importance in grid management and the value placed on accurate and timely predictions.
Furthermore, the synthesis of the dataset characteristics and their respective applications into a coherent framework presents an opportunity for a targeted approach to dataset utilisation.Table 6, which delineates the most popular datasets for each application, serves as a practical guide for researchers and practitioners in the field.By identifying the datasets most suited to specific applications, this summary aids in the efficient allocation of analytical efforts and resources.
In conclusion, the analysis of smart metre and system-level datasets highlights the centrality of certain applications in smart grid analytics and the geographic concentration of dataset origins.The field stands to benefit from an expansion of data sources that better represent the global diversity of energy systems and from leveraging the specialised utility of each dataset.This dual approach can enhance both the breadth and depth of insights in smart grid analytics, fostering advancements that are both innovative and inclusive.

| DETAILED IN-HOME CONSUMPTION DATA
NILM systems provide an efficient way to monitor multiple appliances without the need for submonitoring, hence the name NILM.This section focuses on datasets that enable such systems which are commonly referred to as buildings' datasets [99] or NILM datasets [15].NILM datasets contain data from electrical measurements taken at a very high sampling rate at the plug load, individual circuits in the house, and/or the main line.Data may also include environmental measurements (e.g.temperature), auxiliary data, and information about events such as occupancy status (i.e.how many occupants are inside at any given time) and switches.The availability of labelled power events allows for event-based approaches in energy disaggregation, in contrast to event-less approaches when power events are not labelled.The datasets might also include information on the weather both inside and outside the building.This high frequency is different from smart metre datasets, which typically take measurements every 10-30 min, with most of the commercial smart metre's sampling at less than 1 Hz.Gao et al. [100] suggested a 4 KHz threshold for a feasible and reliable classification of appliances in energy disaggregation.
Higher sampling frequencies of the electrical measurements enable features such as transient information, voltage- The count of articles that utilised the public datasets for particular applications.

T A B L E 6
The most common public dataset for each application.

Application Most common datasets
Load forecasting ISO New England, ElectricityLoadDiagrams20112014 and Australian energy market operator

Load profiling
The building data genome project, pecan street and LCL If the aggregate metering of electricity consumption is not available and only the measurements of individual appliances is available, the data lack a ground truth for evaluating and testing energy dis-aggregation models.Therefore, the dataset is used only for training, while the evaluation is performed on other datasets.The naive method of aggregating all the appliances does not serve as a ground truth since most of the appliances in the house are not monitored.If the appliances events are labelled, then such datasets might be used for event classification.
This section discusses the most popular applications and public datasets in NILM datasets.

| Detailed in-home consumption data applications
NILM datasets are primarily used for developing algorithms to disaggregate total consumption into individual appliances.The output of energy disaggregation systems can be used for purposes such as reducing energy consumption, preventing appliance failures, forecasting SG consumption peaks, and monitoring daily living activities [101].

| Energy disaggregation
The energy disaggregation process typically involves three stages: Event detection, feature extraction, and load identification [102].Event detection captures appliance state transitions, while feature extraction uses steady-state, transient, and non-traditional event detection approaches to extract relevant features [103].

| Energy management system
EMS combines hardware and software to monitor and control energy consumption and generation within a home, helping consumers save on utility bills while maintaining comfort levels [104,105].DR solutions incentivise customers to actively control their energy demand based on market prices [106,107].

| Condition-based maintenance
Condition-based maintenance monitors equipment conditions and performs maintenance tasks based on equipment status, allowing early detection of minor failures and more efficient maintenance strategies.

| Ambient assisted living
AAL focuses on products and services that improve the lives of elderly adults and promote their physical independence.NILM systems can facilitate AAL without the need for obtrusive monitoring [108].

| Appliance anomaly detection
Detecting anomalous appliances using NILM techniques is more cost-effective and practical than using individualised metres per appliance [109,110].However, further development is needed to improve the effectiveness of NILM-based anomaly detection [111,112].

| Detailed in-home consumption public datasets
The NILM datasets have been extensively reviewed in the literature, for example, the authors in ref. [113] provided a comprehensive review of 29 existing open datasets, in terms of settings (residential or otherwise), measurement level (whole premises, individual appliances, and/or individual circuits), electrical and auxiliary measurements, time period, event labels availability and file format.The authors in ref. [16] reviewed 22 open datasets providing the country, the number of households/sites in the dataset and the sampling rate.In the work of ref. [15], 26 datasets were reviewed that provide the same information as the work of ref. [113], in addition to the country of origin.A critical review of all NILM datasets was published in ref. [14] in 2021, in which 42 datasets were comprehensively reviewed.The datasets were divided into high-frequency, lowfrequency, and synthetic datasets.Providing the same characteristics mentioned in refs.[15,16,113] in addition to the name and number of appliances measured.
Table 7 reviews (24) NILM datasets with respect to their measurement levels and frequency, measured quantities and sampling rate, and the applications for the datasets.

| GRID DATA
Electrical grid data prove invaluable for examining typical grid operating conditions and analysing grid behaviour during failures and disturbances.Furthermore, it facilitates the investigation of microgrids in islanding conditions, where the microgrid is disconnected from the main grid, as well as the integration of renewable energy sources.The electrical grid encompasses power generation, transmission, and distribution components, and grid data in the literature enables the emulation of electrical measurements and sensors using various ALTAMIMI ET AL.
-13 T A B L E 7 Measurement levels and frequency, measured quantities and the sampling rate, and the applications for 24 NILM datasets.tools.It is worth noting that researchers often employ interchangeable terms when referring to grid datasets, such as network, case, system, and grid.There are several terms for grid datasets that are not always used consistently in the literature, due to a lack of standardisation [17]:

Dataset
� Test systems: A simple grid built for the purpose of demonstrating a single problem or performing basic validation or testing.Synthetic [138,139] or real grids [140,141] are named test systems.IEEE case 9 [142] and ICPSs [143][144][145][146] are examples of test systems.� Benchmark grids: Grids where the aim is to compare and evaluate different algorithms.For example, the CIGRE systems [147] and the authors of ref. [148] presented benchmark systems.However, it is worth noting that the IEEE test cases are typically used as a benchmark (e.g. for power flow analysis), which highlights the issue of the interchangeable use of grid terms.� Representative grids: Are grids that represent real grids and/ or a set of grids that share similar characteristic (e.g.rural grids).Such grids bridge the gap between technical findings and real-world grids [149].� Generic grids: The work in ref. [150,151] used the term generic to refer to a grid where different parameters can be tweaked to generate various grids.However, the term was synonymous with representative grid in the work of ref. [152].� Synthetic grids: Grids that are neither models of real grids nor derived from a real-world grid.
This section discusses the most popular applications and public datasets for grid data.

| Applications
Grid datasets are used for various applications such as planning, stability analysis, reliability analysis, state estimation, and power flow analysis [153].The SG paradigm has expanded research opportunities in the effective integration of DER and storage devices within the power grid, focusing on assessing the impact of incorporating these elements and evaluating their potential to reduce generation costs, smooth power generation curves, and maintain sustainable service reliability for users [154].

| Planning
Power system planning faces challenges such as generation expansion planning (GEP) and transmission expansion planning (TEP), which involve determining the ideal combination of technology, location, and building time for new generation units and power lines [155].Both GEP and TEP are formulated as optimisation problems with constraints such as the electricity market, congestion, uncertainties, and other considerations [156][157][158][159][160][161][162][163].

| State estimation
State estimation determines the state of the power grid from imperfect measurements, used for online applications like security analysis, anomaly detection, and fault diagnosis, or offline purposes like planning [164].With the advent of the SG, state estimation is increasingly important for distribution grids [165].

| Power flow analysis
Power flow analysis examines the flow of power in a networked system, analysing steady-state operations of power systems and optimising power flow for efficiency [166].

| Reliability and stability analysis
Reliability analysis studies the life cycle of components and the system level, while stability studies examine the steady state and transient stability of power grids [167].

| Transmission and distribution grids
The transmission grid is responsible for delivering the load over long distances from a generating site to electrical substations, while the distribution grid is responsible for delivering energy to consumers.The authors of ref. [168,169] classify the data collected from the grid into: � Standard equipment (e.g.transformers, switch gears, circuit breakers, storage batteries, transmission cables, and ccables) � Technical parameters (e.g.transformers and capacitor ratings, voltage levels, and number of buses) � Cost and maintenance data � GIS data of the power lines, service points, and buildings � Substations data and locations � Parcel use category (e.g.residential) There are several works that reviewed available grid datasets.The work in ref. [17] has reviewed steady-state distribution grid datasets highlighting the intended use case.The authors in ref. [153] reviewed the IEEE and CIGRE benchmark test systems, highlighting the applications done on each.A review of distribution test systems in the United States is presented in ref. [18].The authors analysed IEEE test systems, Pacific Northwest National Laboratory test systems, Electric Power Research Institute representative systems, and the Pacific Gas and Electric Company (PG&E) grids.The IEEE PES Working Group on Cascading Failure [170] provides a comprehensive review of test systems providing the intended use case and technical details on the test grids.This section provides a comprehensive concise summary of the most popular grid datasets along side the intended use cases and popular applications for these datasets.
The Test Feeder Working Group originally released five test feeders: IEEE 4, 13, 34, 37, and 123 bus test feeders.Test feeders are synonymous with test systems with the exception that test feeders have only one power source while test systems incorporate multiple power sources.They were intended to benchmark power flow algorithms, however, various analysis and research was conducted on the five test feeders originally released [171].The test feeders are not representative of large and complex distribution grids and were small to medium radial feeders.In 2010, a sixth test feeder, called the IEEE Comprehensive Test Feeder, was added to model various components of the grid and transformers in particular [172].The feeders are comprised of overhead lines and underground cables, voltage regulators, shunt capacitors, and various degrees of load unbalance [171].Table 8 summarises the intended use cases of the original feeders and other prominent applications in the datasets.
Since then, several benchmark test systems have been made public to serve as a standardised dataset to test various methods and algorithms [186].All IEEE 9,14,30,39,57,118,300, and Reliability Test Systems (RTS)-24 and RTS-73 test systems allow for power flow, state estimation, and planning studies.However, only IEEE RTS-24/73 allows for reliability analysis and IEEE 39 for stability analysis and development of control schemes.Modifying the test systems to allow for different analyses is possible [153].
In 2010 a 8500-bus test feeder was published to represent a full-size distribution system [187], still allowing for the same intended use cases in Table 8.The test system was also used in time series load modelling [188] and DER integration in the SG [189].
Three test feeders and systems were published to tackle specific scenarios and to subvert common assumptions.Table 9 summarises the test feeders and systems with the intended use case and common applications.
Texas A&M university hosts several datasets on their website [198] for electric grid test cases that cover a variety of systems and scenarios, and are crucial in different power system analyses.These datasets do not contain Critical Energy Infrastructure Information (CEII), making them widely accessible for research purposes.
Among the datasets are the latest synthetic electric grid cases of 2023, which include a smaller self-contained island test case for the Hawaiian island of Oahu with a synthetic 138/ 69 kV transmission network.For larger-scale scenarios, there are datasets such as the Texas Synthetic Grid, which covers the ERCOT portion of Texas with a 6717-bus transmission network, and the Combined East-West US Grid, representing a T A B L E 8 The original IEEE test systems and their respective intended use case and common applications.

34-bus
A test system that requires voltage regulators to comply with ANSI voltage standards.Optimal distributed generator placement [178] and optimal placement of storage systems [179].

37-bus
Capability of software to solve for the less common three-wire delta systems.Power flow analysis with DER [180], distributed generators for providing reactive power [181], and micro-grid small signal analysis [182] 123-bus Minimising voltage drops with voltage regulators and shunt capacitor.Power flow analysis in unbalanced systems, operational planning for self-healing action [183], stochastic reactive power management in microgrids with renewable energy [184] CTF Capability of software to solve for a variety of components in one system.Distributed generation applications [185] T A B L E 9 Test feeders and systems, highlighted characteristic, and the intended use case and common applications.

Neutral-earth-voltage test feeder
The neutral conductor is not reduced by Kron reduction [190] because the neutral voltage is above zero.
Study neutral voltages in case of connection failures.Harmonic analysis [191] and load modelling [192] Low voltage network test system A low voltage highly meshed system that represents typical urban areas.The system is also referred to as 342-bus LVNTS.
Tests software capability to handle highly meshed systems.Economic dispatch with DER integration [193] and planning of communications systems [194].

European low voltage test feeder [195]
Represents a typical feeder in Europe and the first feeder to operate at 50 Hz.
Tests software capability to solve for various test feeders.State estimation with DER integration [196] and optimal sizing and placement of renewable energy batteries [197].

16
- synchronously intertied model of the US portion of the eastern and western interconnects.Datasets also exist for the ARPA-E Performance-based Energy Resource Feedback, Optimisation, and Risk Management (PERFORM) program.This program aims to optimise grid management as the penetration of variable renewable resources continues to increase.For this program, specific cases such as the 6717-bus Texas Case and 24,000-bus Midwest Case were created.
Additional datasets from 2021 to 2023 include synthetic transmission and distribution test cases, such as the Full Texas Synthetic Transmission and Distribution Test Case, and the 150-bus Synthetic Transmission and Distribution Test Case based on Travis County, Texas.These datasets also include restoration data and scenarios, as well as an associated natural gas pipeline network.
Other notable datasets come from the ARPA-E GridData program, which offers synthetic electric grid models.These models are designed to be statistically and functionally similar to actual electric grids, ensuring the confidentiality of CEII.Examples include a 200-bus synthetic grid on the footprint of Central Illinois, a 500-bus synthetic grid on the footprint of South Carolina, and a 2000-bus synthetic grid on the footprint of Texas, among others.
Datasets for competitions like the GO Competition Challenge 1, and literature-based power flow test cases such as IEEE bus systems and the Kundur Two-Area System, are also included.Moreover, for stability analysis and control, the dataset provides small signal stability test cases, such as the Three Machines Infinite Bus Benchmark System, the Brazilian Seven Bus System, and the New England 68-Bus Test System.
All of these cases include feasible AC power flow solutions, and some have additional parameters or models for analyses such as transient stability, geomagnetic disturbance analysis, energy economic study, and more.They have been developed to improve the situational awareness of current system operating conditions and to support various studies and research in power systems.

| Power generation
Electricity generation can be divided into two categories: centralised and distributed generation.Centralised generation refers to the generation of electricity through large-scale production plants and the distribution of that electricity to consumers, whereas distributed generation refers to the generation of electricity on a much smaller scale, typically by individuals using renewable energy sources.
In both centralised and distributed generation, the data collected is identical.It includes load demand, historical power measurements, capacity, generating unit, cost, performance, ramp-rate limit, operating zones, and carbon dioxide emissions data.These data are used in the management of both the power generation side and the microgrid side.Power generators are modelled on both transmission and distribution grids (in the case of DER).

| DISCUSSION
After evaluating a comprehensive number of the most popular public datasets and the work done on them, several aspects of the discussion were identified.This section covers these aspects, as well as research gaps and future research directions.For example, for data availability and synthetic data, we identified that the generation of synthetic data oriented to privacy preserving data could solve the problem of data availability, allowing realistic data analysis on otherwise private data [199].In terms of privacy preservation, we have identified two main categories of techniques, which are either 'consumer-oriented' [200] or 'utility-oriented' [201].Analysing the impact of 'consumer-oriented' privacy techniques on the utility of the datasets is an interesting future work.Moreover, to the best of our knowledge, there is no work that aims at identifying consumers that practice such privacy preserving techniques.Regarding data quality, since the number of popular public datasets is fairly low, an interesting research direction is to develop at this early stage a toolkit to unify consumer datasets in terms of format, data exploration, preprocessing techniques, and feature engineering techniques similar to the toolkit developed for NILM datasets [202].
Moreover after analysing the public datasets and the literature, we noticed two relevant and prominent issues: 1. Energy theft detection (and more broadly anomaly detection) and EV detection and load forecasting predominantly rely on synthetic or private datasets.The preference for non-public datasets is largely due to privacy concerns and the proprietary nature of the data.In energy theft detection, the data involves sensitive user information and operational details from utility companies, which are legally protected and competitively sensitive.Similarly, for EV forecasting, private companies hold detailed charging data that is commercially valuable and often kept confidential.2. Newly emerging challenges, such as the detection of unauthorised crypto mining, suffer from a lack of public datasets.The surge in cryptocurrency mining poses challenges to smart grid management.Unauthorised mining operations and rapid technological advancements in this field hinder the collection of accurate and up-to-date datasets.One study estimates the energy consumption to be between 120 and 240 billion kilowatt-hours yearly [203].This level of consumption suggests a significant impact on grid resources, yet the lack of detailed data impedes comprehensive analysis and grid optimisation efforts.Notably, during Texas's energy conservation periods, such as the 2022 summer heatwave, mining activities demonstrated demand flexibility [204].This behaviour indicates a potential adaptive load management strategy, but a detailed dataset is critical for evaluating the feasibility and reliability of such an approach.
These trends highlight a gap in available resources for researchers, emphasising the need for a collaborative effort to establish data-sharing protocols that can balance privacy, commercial value, and research needs to support advancements in smart grid technologies.

| Data availability and synthetic data
One of the major issues facing SG data analytics is the lack of public datasets available, which can be attributed to the reluctance of energy providers to publish their data.Privacy, security, and political issues all contribute to this issue [205].Aside from the privacy concerns posed by energy disaggregation discussed in previous sections, geographical location of consumers can be compromised by solar panels generation data as in ref. [206].The lack of data availability and a standard benchmark is more prevalent in the findings of a 2019 systematic mapping study of 358 articles in SG data analytics [207].Their findings revealed that 70% of the articles were conducted on private datasets, 26% used publicly available datasets, 15% synthetically generated the data, and the remaining 4% used a combination of public and private datasets.Without a standardised large set of public datasets, the issue of reproducibility is expected to persist.As a result, there has been an interest in developing sophisticated techniques to synthesise SG data, and, in particular, energy consumption data, either at an aggregate level or appliance level (i.e. the case of NILM data).There is a lack of focus on synthesising other categories of data such as market data.These types of data are abundant and made public by grid operators because they are necessary information for ISO and consumers.Grid data, on the other hand, is mostly synthetic since they are considered critical information for grid operators.Synthetic data generation, especially using data-driven approaches, also gives rise to opportunities for grid operators to allow realistic data analytics without sacrificing their customer's privacy.GANs were first introduced to generate synthetic data in the work of ref. [208] in 2018 and since then several other works have utilised GANs to generate time series data [90,[209][210][211].The results of these efforts suggest that GANs are a promising research direction.However, simply using GANs is not enough to conceal privacy, as they are susceptible to membership inference attacks [199].

| Privacy and security
With higher sampling rate readings, the analysis of smart metre data on energy consumption patterns can be used to determine household occupancy and other more detailed sensitive information about the household.The serious nature of the privacy issues that smart metres raise has been shown to be a barrier to the widespread implementation of smart metres in some countries [212][213][214].
The work in ref. [215] reviewed the existing literature on smart metres privacy and categorised the techniques into two broad techniques: � Data manipulation: In this category, the high-resolution data is manipulated from the consumer's end before being communicated.Data aggregation, quantisation, and differential privacy techniques [37,[216][217][218] all fall into this category.For example, the effect of data granularity on privacy was studied in ref. [219].However, more sophisticated privacy-aware techniques are required to ensure the aggregation of private data [220].Secure multi-party computation coupled with homomorphic encryption [221], and secret sharing [222] are considered powerful candidates to achieve privacy aware data aggregation.� Demand shaping and scheduling: In this category, smartmetre values are not modified or obfuscated.Instead, batteries, appliance scheduling, and renewable sources hide energy usage within the house and hinder privacy-intrusive attacks, such as NILM.In these cases, smart metres measure perturbed usage after using the battery and renewable sources.As such, locally installed batteries and renewable sources could provide total household demand and privacy is absolutely ensured.Table 10 illustrates the four main categories and exemplary articles.
Security is another critical issue in the SG.Recent published work in ref. [233] provides a comprehensive review of AMI security vulnerabilities in SG in the three layers: hardware, data and communication layers.The identified countermeasures fall into three main categories: � Data encryption: Encryption is critical to preserving confidentiality and privacy at the data layer.The techniques here focus on encrypting the data before communicating them to the utility with minimal computational and communication overhead [234,235].� Authentication mechanisms: Authentication is critical to verify the sources of messages in the SG and to prevent impersonation attacks [236,237].� Intrusion detection systems (IDS): IDSs are a critical second line of defence for detecting security breaches in critical infrastructure.Recent works in IDS for AMI include [238][239][240].
For data encryption and authentication mechanisms, the work is typically evaluated using simulations on any energy consumption dataset to measure the computational and communication overheads.On the other hand, IDS are evaluated on popular datasets that are not specific to the SG.An unpopular solution is to develop testbeds and simulations such as in ref. [240].Developing an IDS dataset in the context of the SG or evaluating the effectiveness of IDS trained on typical IDS datasets in the context of the SG is a necessary research direction.
On the basis of the above, we argue that more focus should be put forward on understanding the impact of demand shaping and load scheduling approaches to preserve privacy on the electrical utility.From a management perspective.These techniques might, for example, induce uncertainties similar to NTL leading to poor utilisation of resources and poor tariff design [241].From a data analytics perspective, such techniques could potentially disrupt the efficacy of load forecasting or energy theft detection models.Another research direction is to consider techniques that identify consumers that practice such privacy-preserving practices, to limit their possible problematic impact on energy management and data analytics.

| Data quality
In the SG context, missing values, outliers, and noisy data (i.e.logical errors or inconsistent data) are the three most common data quality issues [242].Several solutions to each of these problems were suggested by existing work.
Regarding missing values, most datasets do not report missing data, forcing data analysts to manually detect and manage them.For time-series forecasting applications, data replacement (also called imputation) is typically required to preserve the integrity and pattern of the data.In general, the approaches to replace missing data are categorised as interpolation-based and prediction-model-based algorithms.The former being used for a few missing data points, while the latter for longer periods.However, since accurate time series data are necessary to train forecasting models, researchers mostly opt to omit a certain portion or timeframe in the data (e.g. the whole day or similar omission criteria).Although this is a common and straightforward way to deal with missing data, it omits a portion of the available data, which may lead to bias in common statistical analysis (e.g.linear regression) [243].The work in ref. [244] outlines an industry-recognised recommended practice for imputing faulty or missing smart metre data.Periods less than 2 h are often imputed by using linear interpolation to the adjacent data.For times longer than 2 hours, the standard technique is to develop daily load profiles based on previously verified historical data of 'like weekdays' and 'like days'.Holidays or other exceptional cases are often addressed individually.It is important to note that dealing with missing values is not always necessary.For example, in ref. [244] when creating a representative load pattern of a cluster of consumers, the average of the available data points in a given point stamp is taken.
In outliers detection (anomalous data detection), most work utilise the two-standard-deviations rule as a preprocessing step for their respective application.According to ref. [245], there are two types of outliers that should be taken into account when dealing with time series data: isolated anomalies or events where the error is local to a certain set of data points and innovative anomalies where the errors are propagated throughout the time series in the system.
Real power systems also suffer significantly from noise [246], especially after the introduction of powerline communication technologies (PLCs) that support higher data rate transmission (also called high data rate narrowband 3-500 kHz PLC systems).These new technologies are desirable because they can be built on the existing power systems, however they are designed for one-way communication and not the two-way communication necessary for SG applications [247].The noise present in these systems affect very high-frequency electrical measurement devices such as PMU devices.The noise of voltage and current measurements of the phases (e.g. in NILM datasets) at 60 or 120 Hz is negligible and can be ignored.
Data quality issues can extend to several other dimensions, namely contextual, representational, and accessibility.Contextual quality are several characteristics of the data that must be present in certain applications but not others.Such qualities are record time (the time it took for the data to be available after it happened in actual time), sampling rate, and quantity of the data.The representational qualities simply refer to how well a dataset follows the format and structure of similar datasets, as well as interpretability of notations.This issue was found to be a significant hurdle for data analytics [248].The last dimension is accessibility, in particular availability, which is one of the most prevalent issues in the SG context, as some datasets are more readily available to researchers than others.For example, some datasets require extra procedures such as login credentials and/or licencing.
In light of these issues, we argue that more effort should be put to develop toolkits to standardise the datasets as a future research direction.In terms of formatting, for example, the Ausgrid dataset [61] for electricity consumption combines three consumption categories in the same Excel data sheet, while the LCL contains only one.A toolkit with a unified API would make the repeatability of studies much more feasible.Another issue that could be addressed by the toolkits is data preprocessing, since most work on energy consumption utilises similar preprocessing techniques.Feature engineering is another possible extension of such toolkits.For example, a toolkit can facilitate the extraction of time-related features (e.g.peak hours) or apply simple clustering techniques to help with data exploration; clustering daily consumption profiles helps identify common, uncommon, and anomalous consumption habits [249].
T A B L E 10 Demand shaping and load scheduling categories.

Categories
Demand shaping: Batteries A battery (physical or virtual) used for energy consumption can be charged and discharged to obfuscate the fine grain consumption data of the house, thus preserving privacy.[200,[223][224][225] Demand shaping: Renewable energy These techniques obfuscate energy consumption with batteries; however, renewable energy generation must also be modelled.[223,[226][227][228] Demand shaping: Heating and cooling Since cooling and heating have high consumption, scheduling them in a specific way would be able to obfuscate the consumption of smaller appliances and provide more privacy [229][230][231] Load scheduling Scheduling appliances to make non-intrusive load monitoring more difficult [232] 5. 4 | Big data in the smart grid The key steps to handle and use big data are data acquisition, storage, analysis, and operational integration.The work in ref.
[250] reviewed data management for SGs and its technical requirements, the tools, and the necessary steps to integrate big data solutions in the SG context.The authors highlighted three main issues: standards and interoperability; lack of infrastructure to be able to fully utilise the big data; and privacy, integrity, authentication, and security.Furthermore, the authors of ref. [251]

| Detailed in-home consumption datasets
Currently available detailed in-home consumption datasets, or as commonly referred to as buildings datasets or NILM datasets, fall into two categories: laboratory measurements and data from the actual environment.Available laboratory measurements include data from individual devices, although these data are of very little use for overall benchmark tests because real-world datasets contain measurements where multiple devices are active concurrently.However, assigning reference data in real-world scenarios presents difficulties: 1) The synchronisation of references and measured data; that is, a label should correspond to a pattern shift in the data that corresponds to the labelled pattern.A further requirement is that all data streams must be in sync with one another.2) The absence or excess of events, and the number of "on" and "off" cycles for each device.
3) The probability distribution of the devices, as well as the lengthy measurement cycles containing a correspondingly large volume of data that contain a small number of events.
While NILM datasets face trade-offs between covering a large number of houses or focusing on a more extensive set of appliances and measurements, an equally important aspect that is often overlooked in the literature is the preprocessing of data.Ensuring data quality and dealing with missing values is a crucial step in the development of effective energy disaggregation models, as it can significantly improve their performance.Unfortunately, the lack of transparency regarding preprocessing procedures in many studies makes it difficult to replicate results and assess the true impact of any potential bias or domain-specific knowledge that may have been introduced during this stage.By addressing both the trade-offs inherent in dataset design and the need for clear documentation of preprocessing techniques, researchers can work towards developing more robust and generalisable energy disaggregation models.

| Challenges in preprocessing and evaluation
In this subsection, we discuss the challenges and limitations faced in current approaches to data pre-processing, postprocessing, model evaluation, and generalisability in the context of electrical grid data analysis.
Most literature does not mention data preprocessing steps such as data cleaning and dealing with missing values despite being a crucial step known to boost performance.These steps are presumably taken, but not mentioned.Not explicitly stating the preprocessing procedure harms replicability of the work as there are several preprocessing procedures that can be followed.The authors could have introduced bias and/or domain knowledge in the data, which may have enhanced the performance of their models.
We have also observed a lack of post-processing techniques, which we believe is a potential future work to explore due to its promise to enhance performance (especially reducing false positives [254]) and mitigate common typical biases especially in energy disaggregation.For example, the authors in ref. [255] discovered that disaggregation techniques typically overestimate or underestimate disaggregated loads and proposed a technique that ensures that the disaggregated loads sum up to approximately the true aggregate consumption.Similarly, the authors of ref. [256] discovered bias when dealing with appliances that operate on multi-states (e.g.dishwashers and washing machines).Models typically produce several sporadic activations for such appliances.
Another preprocessing issue observed in the literature is the arbitrary exclusion of some data and without justification, which threatens the validity of the models.For example, some houses in the REDD datasets include very few events.These houses were mostly excluded due to the effect they have on training.The issue is not specific to the REDD dataset, as each model has its own setbacks that can be revealed if tested on more houses.To this end, we recommend using techniques such as leave-one-house-out cross-validation for a more complete evaluation in future work.Different authors also select the appliances and a number of appliances that they will train and test on without justification.
There is no clear justification and/or consensus for the selection of the training and testing split.Some train their model for 5 days and test only on one, while others follow a different evaluation strategy.This makes it difficult to compare and evaluate models, not to mention that models will be more likely to overfit the test data and perform better but have lower generalisability.Some models also train and test on the same house, while others train on a house and test on another, which means the former has lower generalisability.The authors also define steady states and transient states differently.For example, a steady-state power signal must not fluctuate more than a certain threshold and must last for a period of time.In probabilistic models, such assumptions extend to the appliances (average, maximum, minimum, and duration of power consumed).While necessary, this poses a trade-off as follows: a more strict (i.e.high threshold) definition will eliminate noise; however, this may lead to not being able to detect small appliances consumption (they will still be considered in the steady state).To better illustrate this point, imagine a kettle that consumes 10 W, if the steady-state threshold was, for example, 20 W then the kettle's consumption will be considered noise and will not be detected.A more lenient definition (or a lower threshold) will allow for small appliances to be detected, however, poorer performance becomes inevitable.In probabilistic models, a more "diverse" assumption on the appliances (e.g., picking appliances that have a high difference in their average consumption) will allow for better distinction between the appliances and better performance overall.However, this will require handpicking of appliances and is thus not practical.
We believe that more attention must be paid to developing models with genralisability and transferability in mind [257].This can be evaluated by training and testing on different datasets or on different houses.Comparing the same model with different datasets poses several challenges.First is the different percentage of missing data in the datasets; some loads that are not sub-metered and their consumption data become missing.The second is the scarcity of fully labelled NILM datasets.The last challenge is the different characteristics of the datasets, such as the type and sampling rate of the measurements, and the different formats.
The learned models are affected by the sampling rate of their associated dataset.Data preprocessing techniques that can capture most of the features at lower sampling rates while still maintaining high performance are a promising future research direction.
Another notable issue is associated with the use of metrics that favour classifying high-power consumption devices.Such metrics do not capture information about how well the model performs in low-power devices.It is argued, however, that such information is valuable since low-power appliances are typically what the user has the greatest control over.

| FUTURE WORK
This section provides future research directions highlighting key areas that require further exploration and development in the field of SG data analytics.
� Synthetic Data Generation: Future research can focus on developing privacy-preserving synthetic data generation techniques for SG, particularly for market and grid data.Another potential avenue is investigating advanced synthetic data generation methods such as the use of GANs while addressing privacy concerns like membership inference attacks.� Advancing Privacy Preservation and Security: Future research can focus on exploring the impact of privacy preservation techniques, particularly demand shaping and load scheduling, on SG data utility and energy management.Another avenue is to develop methods to identify consumers using privacy-preserving techniques because these consumers may affect utilities data analytics.� Improving Data Quality and Standardisation: Researchers in the future may address SG data quality issues by creating comprehensive toolkits for data standardisation and preprocessing, including feature engineering and clustering.
A key focus may be on unifying dataset formats and structures for better data exploration and better analytics accuracy.� Big Data Management and Analytics in SG: Investigate the integration of big data solutions in SG, addressing challenges in data management, standards, interoperability, and infrastructure development.Emphasise improving data acquisition, storage, analysis, and operational integration.� Detailed In-Home Consumption Datasets and Preprocessing Techniques: Future research may aim at improving in-home consumption datasets by refining preprocessing methods for handling data synchronisation, event detection, and large data volumes.Standardising preprocessing steps is crucial for enhancing study replicability and minimising biases.This effort includes better strategies for data cleaning and handling missing values.There's also a need for more inclusive datasets covering diverse appliances and conditions to foster robust, generalisable energy disaggregation models.
By addressing these areas, future research can significantly contribute to the advancement of SG data analytics, ensuring more efficient, secure, and reliable data management systems.

| CONCLUSION
Power grids generate huge volumes of data and specifically in the SG context, where various types of data originate from several sources and typically at higher sampling rates.In addition to enabling safe operation of the grid itself, such data enable a wide variety of applications.Despite their high utility, the availability of public real-world smart grid datasets is very limited.In this work we reviewed over 50 public datasets in the smart grid context, categorising them into three main categories; Consumers' data, NILM data, and Grid data.Each category can enable for a distinct set of applications.After considering the characteristics of the individual datasets, 14 of their most popular applications were discussed, as well as numerous other less popular applications.Several findings are discussed and highlighted throughout this contribution.In the end, we present a discussion of some prevalent issues that motivate potential future research and development directions. - 1  Ultimately, this review provides a comprehensive survey of public datasets in smart and power grid research, with the aim of improving reproducibility and serving as a key reference for researchers developing applications in this domain.
Comparison of current work with existing survey papers.
T A B L E 3 F I G U R E 1 Road map of the paper.4 -ALTAMIMI ET AL. 63] data points.The Ausgrid Distribution Network also provides monthly electrical data.Data are provided for the period from 1 January 2007 to 31 December 2014 and, as a result, it includes periods of household electricity consumption prior to the installation of the solar system.Furthermore, a data set of 4064 non-solar homes is provided for the same time period to compare electricity consumption patterns between the two datasets.� Ausgrid substation data: Since 2005, Ausgrid has provided public access to the load profiles of approximately 180 zone substations through their website, with regular updates that ensure the data set remains current.Each entry in the dataset contains the year, zone substation name, date, and corresponding data unit, followed by a full day's worth of measurements at 15-min intervals.� Past outages: Power supply interruptions that affect 50 or more customers and last for more than 5 min are recorded in the database and published quarterly.The dataset contains information on the start time, average duration of the outage in minutes, number of consumers affected, and its potential cause.The data are organised by LGA as is done for the electricity consumption subset.
T A B L E 5Data-sets characteristics.25152947, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/stg2.12161 by Qatar University, Wiley Online Library on [17/07/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License TA B L E 5 (Continued) a The dataset only includes system-level data.ALTAMIMI ET AL. electricity ALTAMIMI ET AL. 25152947, 0, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/stg2.12161 by Qatar University, Wiley Online Library on [17/07/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)onWiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons Licensecurrent trajectories, electrical noise, and (power, reactive power, and distortion power) trajectories.Although low sampling frequencies can be used to achieve some NILM applications, transient analysis cannot be performed, limiting overall performance and the range of applications that can be used.Voltage, current, and power variables are the features that are most important in low-sampling-frequency NILM datasets, with reactive power being the distinctive feature that is most frequently used in research. 12- [253]ssed several challenges in the area of big data analytics, including data indexing and time synchronisation.The two broad categories of applications in big data are smart metre big data and PMU big data.Smart metre big data applications are related to energy management such as load forecasting, profiling, DR, baseline estimation.The CBT dataset is a common public dataset used for this area of research due to its large volume (167 million data rows)[252].PMU big data are used for state estimation, transmission grid visualisation, and SG reliability and stability.Simulations are commonly used to generate PMU data[253].