Data processing and quality verification for improved photovoltaic performance and reliability analytics

Data integrity is crucial for the performance and reliability analysis of photovoltaic (PV) systems, since actual in‐field measurements commonly exhibit invalid data caused by outages and component failures. The scope of this paper is to present a complete methodology for PV data processing and quality verification in order to ensure improved PV performance and reliability analyses. Data quality routines (DQRs) were developed to ensure data fidelity by detecting and reconstructing invalid data through a sequence of filtering stages and inference techniques. The obtained results verified that PV performance and reliability analyses are sensitive to the fidelity of data and, therefore, time series reconstruction should be handled appropriately. To mitigate the bias effects of 10% or less invalid data, the listwise deletion technique provided accurate results for performance analytics (exhibited a maximum absolute percentage error of 0.92%). When missing data rates exceed 10%, data inference techniques yield more accurate results. The evaluation of missing power measurements demonstrated that time series reconstruction by applying the Sandia PV Array Performance Model yielded the lowest error among the investigated data inference techniques for PV performance analysis, with an absolute percentage error less than 0.71%, even at 40% missing data rate levels. The verification of the routines was performed on historical datasets from two different locations (desert and steppe climates). The proposed methodology provides a set of standardized analytical procedures to ensure the validity of performance and reliability evaluations that are performed over the lifetime of PV systems.


| INTRODUCTION
High-quality data are of utmost importance for monitoring and facilitating advanced performance analytics of photovoltaic (PV) systems. 1 For the rapidly evolving PV industry, the benefits of increasing and improving operation and maintenance (O&M) practices through datadriven monitoring approaches are evident. In this sense, the quality and validity of the acquired data coupled with underlying data-driven performance and reliability analytics are prerequisites for maintaining optimal performance over the lifetime of a system.
With respect to data integrity, invalid data (i.e., missing and outlying values), caused by power outages, equipment/component faults, communication failures, or interruption for maintenance reasons, are a commonly exhibited problem in PV monitoring systems. The processes and techniques applied to mitigate invalid data can potentially introduce noticeable bias that obscures underlying PV performance and reliability analyses. To this end, invalid datasets have to be detected and processed with appropriate mitigation tools, before commencing with data analytics.
Even though data quality constitutes the foundational block for performance and reliability analytics, only few reference guidelines and reports address issues that focus on data processing and quality control tools. Existing guidelines and reports are mainly limited to the requirements of monitoring systems with respect to data acquisition, outlier detection, and data processing for performance assessment of PV plants. [2][3][4][5][6][7][8][9] In particular, several preprocessing data quality checks that include invalid data detection and filtering are outlined in the International Electrotechnical Commission (IEC) 61724 standard. [2][3][4] The standard recommends the application of a stage filter to ensure the presence of data during daylight hours (in-plane irradiance ≥ 20 W/m 2 ) followed by identification of gaps, duplicates, missing, and erroneous data points, which are also filtered out. The recommended methods for identifying invalid measurements include the application of threshold ranges (minimum and maximum parameter bounds), limits on the maximum rate of change between successive data points, statistical methods (not defined), comparisons among different sensors (if available), and clear-sky models to identify outliers. Furthermore, error codes signaled by sensors and data acquisition devices (DAQ) are recorded, and the timestamps are checked to identify gaps or duplicates in a given dataset. The identified invalid data may be discarded or treated by replacement with modeled or estimated values (from the valid data points recorded before or after the missing time step) or with averaged values (from the available data at that time period) in case of partial unavailability. The main disadvantage of the IEC 61724 falls in the qualitative description, which does not offer a case-specific approach that could enable reproducible and unbiased results.
Similarly, the European Joint Research Centre (JRC) guidelines recommend that all processed data are checked for consistency and gaps in order to identify data anomalies. 5,6 Reasonable ranges are set for each recorded parameter, and data points that fall outside these ranges or are otherwise inconsistent are filtered out. Other metrics such as the total time of monitoring activity and outage fraction are recommended; however, these guidelines also fail to provide a universal and quantitative approach for data quality.
A technical report from the National Renewable Energy Laboratory (NREL) highlighted the importance and challenges of obtaining high-quality data through periodic data quality checks. 7 The proposed data quality assurance checks include the identification of missing and erroneous values, inconsistencies in the frequency of data collection, filtering of nighttime measurements, identification of duplicate records, detection of underperformance (by comparing outputs of similar subarrays), detection of outlying and poor data from equipment malfunction (based on nearby sensor data or clear-sky models), and treatment of missing values (e.g., with averaged values or modeled data). Even though the data processing procedure holistically provides information on how to detect invalid data, it does not provide details on treating the identified invalid datasets, nor does it explicitly define how to handle missing data.
An open-source tool for PV monitoring (Pecos) was developed by Sandia National Laboratories (SNL). 8 This tool is designed to perform quality control checks on time series datasets in order to identify a wide range of anomalous conditions within a dataset. It leverages an initial time filter used to eliminate data points that fall outside specific time intervals (e.g., time filter between 3 a.m. and 9 p.m.) and subsequently applies quality control tests to diagnose missing data points by searching for blank (empty) cells, Not a Number ('NaN'), and  10 This library has been validated using raw data from existing PV systems; however, similar to the aforementioned reports, it does not offer specific suggestions on how to detect invalid data from PV systems. Nonetheless, it could be used to implement such guidance.
A quality control routine for detecting invalid power measurements was presented by Killinger et al. 9 The algorithm identifies invalid power output data by setting physical limits, comparing measurements against sky models and indexes (e.g., extraterrestrial irradiance, clear-sky index, and PV power models) and also through the application of system statistics (e.g., variability check of measured power). This quality control routine is focused on the detection part without any information on how to handle the detected invalid data points.
Although the PV-related literature on handling invalid data is limited, [11][12][13][14] studies from other disciplines (e.g., mathematics and computer science) can provide insights into how to deal with data anomalies. More specifically, outliers (which are replaced by NA values) and missing values are dealt based on the missing data pattern (also called missing data mechanism). A study by Batista and Monard 15 demonstrated the presence of three main types of missing data patterns: missing completely at random (MCAR), missing at random (MAR), and not missing at random (NMAR). Missing data classified as MCAR occur when there is no specific mechanism of missingness, while MAR data occur when other variables affect the existence of missing values and the likelihood of having a missing value is independent of the value itself. 14 On the contrary, if the likelihood of having a missing value is associated with the missing value itself, the exhibited missing data pattern is NMAR. 14 Identifying the type of the exhibited missing data pattern is important as it determines which treatment method is appropriate. This is a challenging task because the application of data deletion and inference techniques is strongly dependent on the missing data pattern and requires careful examination of the dataset in order to avoid the introduction of bias.
Existing data quality assurance guidelines analyze either the available valid measurements, excluding the invalid data/periods (by applying either listwise or pairwise deletion) or replace the missing data with modeled or estimated values. 16 In the case of listwise deletion, all rows with at least one missing data point are excluded from the analysis. 17 In pairwise deletion, only the missing values are removed. As such, bias may be introduced in the analysis depending on the missingness rate, deletion method, and so on causing false performance alarms and unnecessary maintenance activities. 11 In order to correct for this bias, missing data could be inferred (either imputed or estimated by a model); however, there is no PV-related investigation available to support this hypothesis. For the MCAR case, missing values can be ignored or inferred without knowing the reason the data are missing. In contrast, the missingness pattern must be thoroughly evaluated for the MAR and NMAR cases, before determining the most appropriate data quality routine.
A previous study demonstrated that missing data rates (defined as the ratio of missing values to the total number of data points) of less than 1% pose negligible impact on performance metrics, whereas missing data rates over 5% necessitate data inference techniques to yield accurate analytical results. 15 Different data inference techniques have been proposed based on statistical procedures, parametric models, empirical, and machine learning approaches. 11,12,15,18,19 Numerous studies propose data imputation with simulated data, 11,20 mean or median imputation, 21 optimally weighted average imputation, 22 multivariate imputation by chain equations (MICEs), 23 linear interpolation (LI), 14 k-nearest neighbors (k-NN) imputation, 15 last observation carried forward (LOCF), seasonal decomposition (SD), 24 bootstrapping, 21 and random forest (RF). 23 Dataset reconstruction is important for ensuring that the significant features of the time series are preserved and not lost due to reductions in the dimensionality.
A standardized and/or universally applicable mechanism of data quality control for PV performance and reliability analyses is not available. Hence, a complete and quantitative PV data processing methodology is proposed in this work for bridging the qualitativequantitative gap that exists in current practices. The proposed methodology builds on quantifiable criteria from IEC 61724 and other PV data quality reports and minimizes existing process gaps that are presented in an ambiguous and/or qualitative manner. Such gaps can be translated into different ways depending on the PV performance analyst and can be one of the main sources of bias and inconsistency.
Therefore, data quality routines (DQRs) that operate on measurements were developed, and each step of the methodology is described in a quantitative manner based on detailed analyses and not arbitrary assumptions. The aim is that the DQRs will become (or contribute to) an open-source library enabling the analysis of bulk PV data and, hence, benefitting the PV industry and research community. Functionality with other packages (such as PVLIB 25 and RdTools 26 ) will also be investigated. The paper is organized in a way to present the complete methodology in the form of a table, where each quantifiable or decision-making step is justified. Finally, data visualization steps are included in the Appendix for completeness.

| METHODOLOGY
The methodology (Figure 1) builds on quantifiable criteria/steps from IEC 61724 standard 2-4 and other PV data quality reports. [5][6][7][8][9] It is a pipeline of sequentially structured DQRs that include the application of initial statistics, consistency examination, filtering, detection of invalid values and data rates, treatment of invalid data, and aggregation at different granularities. 6. Is the methodology system-and location-independent?
In order to answer these questions, reference datasets were constructed using module and system measurements from two different locations. Both averaged and instantaneous measurements were utilized, and PV performance and reliability metrics were extracted.
Artificially 'invalid' datasets were also generated by introducing missing data at different rates and sequences to enable a comparative analysis. The DQRs were then applied by detecting and treating the invalid datasets using different methods of deletion and inference.
Each step of this parametric analysis was compared against the reference values in order to optimize the DQRs methodology.

| Experimental apparatus
The developed DQRs were validated against data of different sampling and geographical locations. The field measurements were acquired from a well-maintained test PV module installed at the outdoor test facility (OTF) of Gantner Instruments (GI) in Arizona, US (Köppen-Geiger-Photovoltaic climate classification BK; desert climate with very high irradiation) 27 and a test PV system at the OTF of the UCY in Nicosia, Cyprus (Köppen-Geiger-Photovoltaic climate classification CH; steppe climate with high irradiation) 27 ; system availability was higher than 98%. The polycrystalline silicon (poly-c-Si) PV module was installed in an open-field mounting structure at the GI OTF, and it is rated at 220 W p nominal power. The test PV system at the UCY OTF includes 5 poly-c-Si PV modules (rated at 205 W p ) that were connected in series to form a string of nominal power capacity 1.025 kW p at the input of a string inverter. 28 The PV system is installed in an open-field mounting arrangement.
The electrical performance of the test PV module and system along with the prevailing irradiance and environmental conditions was recorded according to the requirements set by the IEC 61724 standard and stored with the use of a measurement monitoring platform. 2 The monitoring systems at both locations include solar irradiance (pyranometers), wind (anemometer and wind vane), temperature (thermocouples), and electrical (current shunts and transducers, voltage divider, and transducers) sensors connected to a central DAQ system that stores data at every second. Both outdoor test sites collect: in-plane irradiance (G I ), ambient air temperature (T amb ), wind speed (W s ), and direction (W a ). The PV measurements include module temperature (T mod ), array current (I A ), and voltage (V A )-multiplied together to calculate the DC power (P A ) and the AC output power (P out ). Additional yields and performance metrics such as the final PV system yield (Y f ) and the reference yield (Y r ), the monthly performance ratio (PR), and the monthly temperature-corrected performance ratio (PR TC ), were also calculated. 2

| Reference datasets
In order to demonstrate the effectiveness of the data processing and quality verification methodology, three reference datasets were generated to serve as a baseline for comparison and benchmarking in respect to (1) location-independence, (2) system-independence,

| Artificially invalid datasets
In an attempt to examine the impact of missing data points (indicated with NA) on PV performance and reliability analyses, different invalid datasets were generated by inserting artificial missing data points in the reference datasets.
More specifically, the artificially invalid datasets were generated by first selecting a missing data rate from 1% to 40% in whole number increments. Then, data records from the reference datasets were randomly selected and replaced with NA until the target missing data rate was reached. The selected data points were designated as MCAR in order to create artificial missing periods, 29 and then apply any data F I G U R E 1 Flowchart of data processing and quality verification methodology treatment method to the missing data without the risk of introducing bias. 14,15 This process was repeated 50 times for each missing data rate (1%, 2%, …, 40%), resulting in 2,000 invalid datasets per reference dataset. The record selection was performed in two specific ways: • Random: missing data were randomly added by iteratively sampling a random number from 1 to 4,100 (daylight hours in a year) and assigning that hour as missing (NA) until the target missing data rate was reached.
• Continuous: the method for generating missing data assumed that the missing data were due to a sensor or system outage that resulted in continuous and consecutive missing data. To generate realizations for this case, a single random number from 1 to 4,100 was sampled, which was assumed to be the start time of the outage. For each target missing data rate (from 1% to 40% in whole numbers), the number of hours after the start time were all assigned as missing (NA). For example, in the case of a 10% missing data rate, 410 hours would be marked as NA beginning at the random start time. A constraint was added to ensure that the start time is early enough so that the entire missing period fits into the year.
A structured sampling method was used to ensure that random and continuous missing data points were evenly distributed in different months in order to capture seasonality. It should be noted that current O&M best practices in Europe restrict the continuous missing data rate to <10% for electrical and irradiance data. 7 When the rate is higher, the whole period is discarded. 7 However, this might not be the case worldwide, and therefore, this analysis extends to up to 40% of missing data rates for completeness.
During the investigation, different scenarios of invalid data points were considered, reflecting real PV system monitoring test cases of data loss:

| Data handling techniques
The invalid datasets were analyzed by (a) discarding the missing measurements and analyzing only the available measurements (pairwise deletion method) and (b) discarding the missing periods with invalid measurements and analyzing only the periods with all data values available (listwise deletion method). To mitigate the effects of missing measurements, the invalid datasets were treated using data inference techniques that back-fill the missing measurements with estimated values from statistical or empirical models. In this work, the invalid data points were treated with two univariate imputation methods (RF and bootstrap), multiple imputation by predictive mean matching (PMM) 23  values. 21 When additional recorded measurements were available (i.e., electrical and/or meteorological measurements), statistical approaches based on multiple imputation and empirical models were applied for the inference process. For this purpose, imputation by PMM was employed. For each missing value, the PMM method creates a small set of candidate donors from the complete dataset that have predicted values closest to the predicted value of the missing data point. 30 In parallel, a simplified version of the Sandia PV Array Performance Model (SAPM) was also used to predict the power output of the PV modules/systems. 31 Abbreviations: G I , in-plane irradiance; P A , array power (DC); T mod , module temperature.
where G STC , P STC , and T STC are the standard test conditions (STCs) irradiance, power and temperature, respectively, and k 1 − k 6 are empirical fit coefficients. In order to define the best set of empirical coefficients and capture the local climatic conditions, at least 40-50 days are required for the training process. 34 Similarly, module temperature was calculated from the in-plane irradiance and ambient temperature using the Ross thermal model 35 : where k is the Ross coefficient extracted from the graphical representation of T mod − T amb against G I using the valid available measurements.
For installations that include wind speed measurements, it is possible to offset the influence of wind on the module temperature by using the Sandia module temperature model (SMTM) 31 : where a and b are empirical coefficients to establish the upper limit for module temperature at low wind speeds and high solar irradiance and to account for forced convection by wind, respectively.
Finally, for locations under transient climatic conditions, module temperature can be calculated using the weighted-moving-average temperature model 36 : where i is the index of a number of prior timesteps, t i is the number of seconds in the past for each timestep, T SS,i is the steady-state temperature prediction at t i seconds in the past ( C), and T mod A,i is the moving-average model temperature prediction for the current timestep ( C).

| PV performance and reliability metrics
The PR was selected as the performance metric in this investigation because it is a normalized parameter and a key performance indicator (KPI), typically used to characterize PV plant performance for acceptance and operations testing. 37 The reliability of the PV modules was evaluated based on the performance loss rate (PLR), which can either be linear or nonlinear. [38][39][40][41] In this analysis, a constant PLR over time was assumed and estimated by applying linear regression with ordinary least squares (OLS) on the 5-year reference dataset of the test PV system in Cyprus. 42 The absolute PLR was calculated as follows: where a is the slope and t is a conversion factor between the timestamp and years (e.g., 12 or 365 for monthly or daily aggregation, respectively). For a reliable evaluation of the PLR, at least a 5-year PR time series should be available to yield credible results that are not influenced by seasonal performance variations. 43,44 In order to compare the PLR obtained using the reference dataset against the PLR values obtained from the artificially invalid datasets (constructed in Section 2.3), the absolute percentage error (APE) was used: where A t is the actual value and P t is the predicted value. For this analysis, A t was set as the average monthly PR (or the reference PLR for reliability analysis) from the reference dataset with no missing data, while P t was set as the average monthly PR of the 2,000 invalid datasets.

| Data deletion methods
The influence of missing data on the PR estimates of the test More specifically, the comparison between the PR of the reference and reconstructed (using pairwise deletion method) time series exhibited deviations in the range of 0.04-0.46 at missing data rates between 1% and 40% for randomly invalid datasets (Figure 2A). The results further showed that for a missing data rate of 40%, the PR that was reconstructed with pairwise deletion (0.517 ± 0.029) was not in agreement with the PR of the reference dataset (0.866 ± 0.034).
Higher spread of the average monthly PR (in the range of 0.01-0.74) and lower deviations from the mean (0.553 ± 0.236 at 40% missing data rate) were observed when the continuous invalid datasets were reconstructed using the pairwise deletion method ( Figure 2B).
The effect of random missing power measurements was mitigated by listwise deletion, even for a 40% missing data rate since the calculated PR of the invalid datasets (0.866 ± 0.033) agreed with the PR of the reference dataset (0.866 ± 0.034), as shown in Figure 2C.
The results in Figure 2C, demonstrate that the effect of random missing power measurements (Test Case 1a) was successfully mitigated by listwise deletion, exhibiting an APE up to 0.12%, even at 40% missing data rates.
The application of listwise deletion proved to be an effective method of handling continuous missing power measurements of up to 10% missing data rate by providing a maximum APE of 0.92% on the PR.
Conversely, at higher missing data rates (in the range of 15% to 40%), the application of listwise deletion was not optimal since APE values up to 62.01% were obtained on the PR, as shown in Figure 2D. The large deviations observed at missing data rates higher than 10% (and for the worst-case scenario of Test Case 1b-whole month missing in a yearly dataset) signify the need of using other mitigation routines (i.e., application of data inference techniques to back-fill missing measurements).
Another important outcome of this investigation was that the application of listwise deletion was effective in reconstructing the time series for all invalid datasets of the investigated test cases (i.e., missing irradiance measurements and module temperature measurements in the case of PR TC ) at levels of up to 40% random and 10% continuous missingness.

| Data inference on missing power measurements
The resulting 2,000 invalid time series of Test Case 1b were reconstructed by applying different inference techniques (bootstrap, RF, multiple imputation by PMM, and back-filled using the variant of the SAPM model). 32 The average monthly PR of the reconstructed time series is shown in Figure 3, illustrating that the application of data F I G U R E 2 Boxplot of the average monthly performance ratio (PR) of the poly-c-Si PV module for (A) random missing power datasets (Test Case 1a) reconstructed using the pairwise deletion method, (B) continuous missing power datasets (Test Case 1b) reconstructed using the pairwise deletion method, (C) random missing power datasets (Test Case 1a) reconstructed using the listwise deletion method, and (D) continuous missing power datasets (Test Case 1b) reconstructed using the listwise deletion method.  Figure 3. This demonstrates the suitability of the SAPM as a data inference model.
Finally, the analysis conducted to investigate the effect of time series reconstruction on the PR calculation showed that data fidelity can be ensured with the application of data inference techniques that treat invalid datasets.

| Data inference on missing irradiance measurements
A benchmarking exercise was carried out by inferring the missing irradiance datasets (Test Case 2b) of the 2,000 invalid time series, and the boxplots depicted in Figure 4 show the average monthly PR values calculated from the reconstructed time series by employing univariate imputation by bootstrap and RF, and multiple imputation by PMM.  Figure 5. This demonstrates the suitability of the SMTM as a data inference model.

| Data deletion and inference verification
The investigation was also applied to the 2,000 invalid datasets of the test PV system installed in Cyprus in order to verify the location, system and sampling independence of the proposed DQRs. Table 2 shows close agreement to the results obtained when analyzing the time series of the test PV module in Arizona, verifying the performance of the proposed DQRs methodology independent of system, location, and sampling method.

| Data integrity effect on PLR analysis
In order to test the sensitivity of the PLR to invalid data points, an analysis was conducted by comparing the reference PLR ref values against the estimates from the 2,000 invalid datasets of Case 1b.
The PLR analysis demonstrated that the annual PLR (calculated by applying OLS to the monthly PR time series of the PV system in Cyprus) was sensitive to the amount of continuous invalid power datasets (Test Case 1b), even at 1% of missing rates ( Figure 6). The

T A B L E 2 Summary of results of the proposed data quality routines applied for Test Cases 1-3 on invalid datasets from the PV system in Cyprus
Missing data rate (%) • Identify outliers by: a) Physical limits (threshold ranges) 4,45 : 0 W/m 2 < G I < 1,300 W/m 2 0 W < P out < 1.02 × AC inverter power rating W 0 V < V A < 1.3 × V OC of the array V 0 A < I A × 1.5 × I SC A −40 C < T amb < 60 C −40 C < T mod < 100 C for open rack mounted −40 C < T mod < 120 C for roof-mounted and building-integrated systems 0 m/s < W s < 32 m/s 0% < PR < 110% b) Comparison of measurements from different/multiple sensors, sky models, and indices (e.g., clear-sky and PV power models) 9 c) Maximum change between successive data points (applicable only for up to 15-min time interval) 4 : G I > 800 W/m 2 P out > 80% rating W T amb > 4 C T mod > 4 C W s > 10 m/s d) Visual inspection of scatter plots 47 e) Apply statistical and comparative tests (local outlier factor, sigma rule, Hampel identifier, boxplot rule, etc.) 2,7 • Replace outliers by 'NA' values b. Identify missing values • Search for 'NA' or 'NaN' values and blank cells 5. Identification of missing data rate a. Identification of missing data mechanism and rate • Identify missing data mechanism (MCAR, MAR, or NMAR) by applying a visualization method (see Appendix A) • Identify missing data rate and missingness rate for every recorded field measurement 6. Handling invalid values and dataset reconstruction a. Invalid data treatment • Missing data rates lower than 10%: Discard the missing period (listwise deletion) or Infer the missing measurements for (a) a whole month missing for a yearly performance analysis and (b) providing robust degradation and performance loss rate estimates • Missing data rates higher than 10%: If meteorological data are available, infer the missing data using empirical models a) Back-fill missing power measurements for c-Si PV modules using the SAPM from the Python Library (PVLIB), 25 25 higher APE deviations (in the range of 6.11%-35.01% at missing rates of 5%-10%) provide evidence that the application of listwise data deletion introduces a bias in the calculation of the PLR consistently at increasing missing data rates. Furthermore, the data inference using SAPM and PMM yielded more accurate PLR results when compared to the PLR estimates obtained by listwise deletion. In particular, for a missing data rate of 10%, the maximum APE of the PLR was 35.01% by listwise deletion, while when using the SAPM and multiple imputation by PMM, the obtained APE of the PLR was less than 3.03%. For a missing data rate of 40%, the maximum APE of the PLR calculated by listwise deletion was 48.91%, whereas data inference techniques demonstrated an APE lower than 8.50% (average APE of 3.86%). The SAPM yielded robust PLR estimates; for a missing data rate level of 40%, the maximum exhibited APE of the calculated PLR was 5.69% (average APE of 2.82%).

| OUTLINE OF DATA PROCESSING AND QUALITY VERIFICATION FRAMEWORK
The proposed methodology is a pipeline of sequentially structured DQRs that include the application of initial statistics, consistency examination, filtering, detection of invalid values and missing data rates, treatment of invalid data, and aggregation at different granularities.
The initial step (Step 1) uses data statistics to determine the recording interval (time between two consecutive time records) and the reporting period. For PV performance and reliability analyses, the reporting period should be long enough to provide representative PV operational data and ambient conditions (i.e., minimum of 1-year of continuous monitoring for outdoor PV performance evaluation). 45 The fidelity of the time series is then examined to find timestamp gaps, repetitive entries, duplicate records, and synchronization issues between meteorological and electrical data (Step 2). After removing the repetitive and duplicate timestamp records, the time series is verified and resampled against known (i.e., simulated) timestamp series. In case of mismatches between timestamps of different dataloggers (e.g., weather and electrical measurements are logged separately), a data time series synchronization is performed. A daylight filter (G I > 20 W/m 2 ) is then applied to the dataset (Step 3). 2 Alternative daylight filters may include the use of time or sun elevation filters.
The missing values are then identified by searching for NA or NaN values within the examined dataset. Outliers are detected by imposing physical limitations on the recorded data, applying variation limits between successive data points methods and statistical and comparative methods (e.g., Local Outlier Factor, Sigma rule, Hampel identifier, boxplot rule, and rolling mean). 4,7,46 The detected outlying values are then replaced by NA values. At this point, the missing data rate is calculated. The next step (Step 5) is to identify the missing data mechanism (MCAR, MAR, and NMAR) by applying a suitable data visualization method (e.g., heatmaps, aggregation, scatter, and spine plots). Once the invalid data points are detected, the dataset is reconstructed based on the calculated missing data rate and mechanism (Step 6). 15 Thus, missing values are either treated by data deletion (pairwise or listwise) or inference techniques (e.g., empirical models, multiple, and univariate data imputations) in the form of dataset reconstruction routines.
In the case of missing data rates lower than 10%, the missing periods should be discarded from the dataset (listwise deletion). To

T A B L E 3 (Continued)
If meteorological and satellite data are not available, impute the missing values using univariate data imputation techniques b : a) Impute missing power measurements using the bootstrapping univariate data imputation technique b) Impute missing module temperature measurements using the bootstrapping univariate data imputation technique If satellite data are available, infer the missing meteorological measurements using satellite observations c If electrical data are available, impute the missing data using multiple imputation techniques: a) Impute missing meteorological measurements using multiple imputation by PMM Once the dataset is treated and reconstructed, aggregation is applied depending on the final use (Step 7). Final data statistics are then recorded based on the reconstructed dataset (Step 8). Table 3 summarizes the data processing and quality verification steps, with all quantifiable metrics and specifications of each DQR in the form of guidelines.

| CONCLUSIONS
A unified methodology for PV data processing, quality verification, and reconstruction is presented in an attempt to reduce bias and enable reproducible PV performance, degradation, and PLR analyses.
The methodology is a pipeline of sequentially structured DQRs that include the application of initial data statistics, consistency examina-

A.1 | DQRs visualization
A dataset from a PV system installed at the OTF of the UCY was utilized to visualize the steps of the data processing and quality verification methodology.
Step 1: The recording interval is 15-minute, and the reporting period is 365 days, which is in line with the minimum reporting period requirement of 1 year for PV performance assessment. In addition, the dataset consisted of 35,040 rows (records) and 8 columns (recorded field measurements); Date/Time, in-plane irradiance (G I ), ambient air temperature (T amb ), module temperature (T mod ), array current (I A ), voltage (V A ) and power (P A ) at the DC side, and AC output power (P out ).
Step 2: The dataset was examined for consistency; 5 repetitive/duplicate and NA timestamp (Date/Time) records were identified. After removing the repetitive and duplicate timestamp records, the time series was checked and verified against known (simulated) timestamp series.
Step 3: An irradiance filter (G I > 20 W/m 2 ) was then applied to the dataset to yield daylight time series. As a result, the number of rows was further reduced to 15,820.
Step 4: Missing values were identified by searching for NA values in the dataset, while the outlying data points were detected by imposing range limits on the data and visually inspecting scatter plots. The developed DQRs identified 61 outlying (erroneous) in-plane irradiance and AC output power data points through the application of physical limitations. Visual inspection of the irradiance-power diagnostic plot ( Figure A1) was deemed sufficient for detecting outlying values.
Observations close to the irradiance-power linear relationship line are assumed to be normal, while observations far from this line are considered as outliers.
Similarly, the dataset was also examined for global outliers using automated approaches (i.e., Sigma rule method, Hampel identifier, or standard boxplot rule). The outlier detection routines (ODRs) identified 31 outlying values for the recorded AC power measurements (the ODRs results for the AC and DC power measurements are depicted in Figure A2).
A comparison of measurements from two identical pyranometers installed nearby was performed to interpret the working condition of the irradiance sensor. The irradiance measurements acquired from the pyranometers were plotted on the scatter diagram of Figure A3, showing their linear relationship. The extracted determination coefficient (R 2 ) was 0.998, indicating that the system irradiance sensor was operating properly.
The detected outlying values were replaced by NA.
Step 5: The missing data rate was calculated. The portion of missingness for each recorded measurement is depicted in Figure A4 (where black color indicates missing values while grey color represents available measurements).
Additionally, in order to identify the missing data mechanism, an aggregation plot ( Figure A5) was used to visualize the data and expose the relationship between available and missing data points. Figure A5A shows a bar for the recorded measurements where the bar height corresponds to the proportion of missing values. Figure A5B shows the missingness pattern for the variables. Closer inspection of Figure A5B reveals no links between the missing values for the acquired measurements. Thus, the missing data mechanism is MCAR.
Step 6: The DQRs provide algorithms for handling missing invalid values and dataset reconstruction. Since the missing data rate was less than 10%, the identified invalid data points are treated by listwise deletion.  Abbreviations: G I , in-plane irradiance; I A , array current; NA, not available; P A , array power (DC); P out , AC output power; T amb , ambient air temperature; T mod , module temperature; V A , array voltage.