Short-term load forecasting at electric vehicle charging sites using a multivariate multi-step long short-term memory: A case study from Finland

This study assesses the performance of a multivariate multi ‐ step charging load prediction approach based on the long short ‐ term memory (LSTM) and commercial charging data. The major contribution of this study is to provide a comparison of load prediction be-tween various types of charging sites. Real charging data from shopping centres, residential, public, and workplace charging sites are gathered. Altogether, the data consists of 50,504 charging events measured at 37 different charging sites in Finland between January 2019 and January 2020. A forecast of the aggregated charging load is performed in 15 ‐ min resolution for each type of charging site. The second contribution of the work is the extended short ‐ term forecast horizon. A multi ‐ step prediction of either four (i.e., one hour) or 96 (i.e., 24 h) time steps is carried out, enabling a comparison of both horizons. The findings reveal that all charging sites exhibit distinct charging characteristics, which affects the forecasting accuracy and suggests a differentiated analysis of the different charging categories. Furthermore, the results indicate that the forecasting accuracy strongly correlates with the forecast horizon. The 4 ‐ time step

is presented, which considers important determinants of the load curve as multivariate inputs and allows the forecast of the whole prediction horizon in a single computation.
The subsequent parts of this paper are organised as follows.Section 2 presents the state of the art in recent studies.Section 3 deals with the basics of LSTMs and the origin of the data used in this work.Section 4 introduces the methodology of the charging load forecast.Section 5 presents the results, which are further discussed in Section 6. Conclusions and directions for future research are provided in Section 7.

| RELATED RESEARCH
In this section, a literature review of the state of the art in the field of EV short-term charging load forecasting is conducted to identify the shortcomings in previous research and highlight the contributions of this work.

| Linear methods
Linear forecasting approaches use linear functions to model time series behaviour.Popular methods include autoregressive, moving average, autoregressive moving average, and autoregressive integrated moving average (ARIMA) processes.Concerning the prediction of the EV charging load, ARIMA models in different versions are particularly popular among all linear approaches.A plain ARIMA model is introduced in [15] to forecast the day-ahead EV charging demand in 15-min (min) resolution.The prediction error is lowered with larger aggregation.Another work [16] examines the applicability of an ARIMA model to predict the EV charging demand of the next 24 h (h).The proposed decoupled forecaster, which independently predicts the charging demand and the conventional electrical load, significantly reduces the prediction error compared to an integrated forecaster.
The works in [17][18][19] present a more sophisticated version of the ARIMA model.While paper [17] introduces a fractional autoregressive integrated moving average (FARIMA) model, article [18] proposes several seasonal autoregressive integrated moving average (SARIMA) approaches for EV charging load forecasting.The FARIMA model outperforms various ARIMA models for all forecast horizons up to 120 min.The SARIMA models achieve superior results compared to a persistence forecast and a modified pattern sequence-based forecast (MPSF).Similar to the work in [18], study [19] introduces two seasonal autoregressive integrated moving average models with exogenous variables (SARIMAX).Compared to a random forest (RF) and gradient boosted regression tree (GBRT) model, the SARIMAX models outperform both machine learning algorithms for different forecast horizons.

| Non-linear methods
Contrary to the results in [19], several studies indicate a superior performance of non-linear methods for EV charging load predictions.Compared to their linear counterparts, non-linear approaches rely on non-linear functions to capture more complex time series behavior.Study [20] applies four different algorithms, namely the RF, support vector regression (SVR), time weighted dot product based nearest neighbor (TWDPNN), and MPSF for 24-h charging load forecasting.The evaluation is based on two different datasets -charging and station measurements.The most accurate predictions based on the charging measurement are achieved by using the TWDPNN approach, whereas the MPSF provides the most precise results for the station measurement dataset.An approved RF algorithm is applied to predict the EV charging load of single and grouped EV charging stations in [21].The analysis reveals that in contrast to an aggregated prediction, more precise results are obtained while performing individual forecasts.
Moreover, considerable attention has been paid to artificial neural networks (ANNs) lately.The predictive performance of different neural networks (NN) is frequently addressed in the literature [22][23][24][25].Article [22] performs a 24-h charging load forecast based on historical driving patterns.Three different ANNs -a simple ANN, a rough artificial neural network (R-ANN), and a recurrent rough artificial neural network (RR-ANN) -are deployed and compared to a Monte Carlo simulation (MCS).RR-ANNs generate the most accurate predictions.In [23], a novel reinforcement learning technique is introduced to forecast the EV charging load under three scenarios -uncoordinated, coordinated, and smart charging.The proposed Q-learning technique increases the prediction accuracy compared to a simple ANN and a recurrent neural network (RNN).The work carried out in [24] studies the short-term charging load predictability of four different types of ANNs -a deep neural network (DNN), an RNN, an LSTM, and a gated recurrent unit (GRU).GRUs with one hidden layer achieve the most accurate prediction results.Study [25] provides a further evaluation of different ANN approaches for super short-term charging load forecasting.The predictive performance of a simple ANN, an RNN, an LSTM, a bidirectional LSTM, a GRU, and a stacked auto-encoder (SAE) are analyzed.The LSTM yields the best performance in both case studies -the charging in public and at commercial buildings.
Similar to the results in [25], articles [26][27][28] demonstrate the superior performance of the LSTMs for EV load forecasting.The study in [26] develops an LSTM for single-step predictions.The analysis demonstrates that the LSTM outperforms a simple ANN, both for a forecast horizon of 15 and 30 min and that the prediction error diminishes with decreasing forecast horizon.Another LSTM forecasting approach is presented in [27].The work considers four types of EVs -private and commercial EVs, electric busses, and electric taxis.Based on the MCS, the LSTM outperforms a backpropagation (BP) network and SVR.Paper [28] proposes an LSTM-based approach with hybrid classification to forecast EV travel behaviour and their electrical demand.Compared to copula, quasi-MCS, and MCS, the novel approach improves the forecast accuracy, thus, resulting in lower EV aggregator costs.The research in [29] presents a hybrid model of extreme gradient boosting and LSTM for domestic charging station load forecasting.Multivariate inputs are used to predict the EV consumption in the next period.Study [30] proposes a hybrid approach of a deep belief network (DBN) and the LSTM network for short-term EV load prediction.The LSTM-DBN achieves superior forecasting results compared to a single DBN and single LSTM.

| Hybrid methods
Similar to studies [29,30], other hybrid approaches as compounds of different models have gained great popularity in recent studies.Paper [33] combines a deterministic bottom-up approach with a probabilistic RF model for the short-term forecast of the aggregated load of 46 privately owned EVs in Austin, Texas.The proposed model shows superior performance compared to a persistence forecast, but similar to a GBRT model.The study in [31] combines a least squares support vector machine (LSSVM), fuzzy clustering (FC), and a wolf pack algorithm (WPA) to predict the load of electric bus charging stations.The FC-WPA-LSSVM model outperforms a WPA-LSSVM, a regular LSSVM and a BP-NN.Article [32] proposes another hybrid approach, combining a lion algorithm with the niche immune (NILA) and a convolutional neural network (CNN), for short-term EV charging load forecasting.Taking into account the multivariate inputs, the NILA-CNN shows superior performance to a single CNN, a lion algorithm, CNN, and an SVM.The work in [34] introduces a hybrid model, based on wavelet decomposition (WT), convolutional neural networks, and probabilistic queuing, for dayahead charging load predictions.The WT-CNN outperforms a BP-NN, an SAE, an RNN, an SVR, a time-delayed neural network (TDNN), and a growing DBN.

| Research gaps and contributions of this work
Table 1 outlines the main findings of the literature review.All in all, three major limitations can be summarised.The studies either lack real charging data [16,17,22,23,27] or the data employed is outdated, making it difficult to transfer the results to today's strongly changed situation in terms of a larger number of EVs, different charging powers, and the altered utilization of EVs [15, 18-20, 33, 34].Moreover, most of the works only perform single-step predictions [16, 20, 21, 23-27, 29, 31, 32] which limits the scope of practical applications due to the short forecast horizon.Lastly, although in Ref. [25] it has been indicated that the achievable prediction accuracy may be dependent on the type of charging, no attention has been paid to the analysis and comparison of different charging sites.
This work contributes to the identified research gaps in the following ways: � Introducing a novel multivariate, multi-step LSTM application, which significantly extends the existing short-term forecast horizon by providing a 4-time step (ts) or 96-ts prediction at once, � Comparing and assessing the forecasting accuracy at different categories of charging sites which exhibit different charging characteristics, and � Using a large amount of high-quality charging data from various commercial charging sites.

| DESCRIPTION OF THE DATASET AND LSTM
This section deals with both the characteristics of the real charging data used in this work and the LSTM fundamentals.

| Introduction of the dataset
This study is based on real charging data of 50,504 slow charging events with a maximum capacity of 22 kW, measured at 37 different charging sites in Finland between January 2019 and January 2020.While the smallest charging site contains less than 10, the largest one features nearly 300 charging stations.An overview of the characteristics of the charging data is illustrated in Table 2.The charging events are clustered according to the four categories of charging sites selected for this work.The data for the first category, the charging at shopping centres (SCc), stems from one charging site.The charging events of 21 sites are assigned to the second category named residential charging (REc), representing the charging at home.The third category, public charging (PUc), covers the charging at car parks that can be used by all EV drivers.The last category, workplace charging (WOc), refers to EV charging at company premises that can only be used by the staff working in the vicinity.Eight and seven sites are assigned to PUc and WOc, respectively.

| Long short-term memory
The charging load of EVs is typically subject to strong time dependencies as the load corresponds to cyclical and seasonal patterns.The LSTM is able to overcome the vanishing or exploding gradient problem of RNNs, and to learn both long-term and short-term dependencies, thus making it an ideal choice for this work [35].The ability to learn temporal correlations stems from the unique structure of an LSTM cell, illustrated in Figure 1.The LSTM at a single ts t consists of three gate units, namely the forget f t , input i t , and output gate o t .
The functionality of the LSTM cell can be expressed mathematically by the following formulas [36]  -

| METHODOLOGY
The methodology of the proposed multivariate multi-step LSTM is shown in Figure 2 and consists of three main stages -data preprocessing, LSTM implementation and training, and LSTM forecasting.EV, electric vehicle; LSTM, long short-term memory

| Data preprocessing
The following paragraph details the data preprocessing.

| Aggregated charging load time series generation
In the first step, the event-based charging data is converted into a time series of aggregated charging load values in 15-min resolution under the assumption of a constant charging power.

F I G U R E 1
Architecture of an long short-term memory cell at a single time step t.
The data contains both the plug-in-and power-to-zerotimestamps and the original amount of energy, energy or , charged for each charging event.Consequently, initially, the original charging time, t or , in min is derived by measuring the deviation between both timestamps.In case the power-tozero-timestamp is missing due to measurement errors, a nominal charging power of 1.8 kW is assumed as most of the charging events indicate charging at low powers.The original charging time is calculated by dividing the original amount of energy charged by the assumed charging power.Subsequently, since the charging times might change due to the analysis interval of 15 min, the original charging time is adjusted subsequently.If the original charging time lasts less than 15 min, it is rounded up to a full 15-min interval.Otherwise, a modulo operation, mod, is performed and the modified charging time, t new , is calculated according to (7).
Furthermore, the modified charging load, load new , of each charging process is calculated according to (8) under the premise of a constant amount of energy charged.Lastly, by aggregating the calculated load of each charging event, the charging load time series is generated, spanning one year with 96 load values for each day.

| Multivariate input selection
To determine the important features to be fed to the LSTMas multivariate inputs to support the detection of interrelationships, an analysis of the characteristics of the charging load is carried out.Figure 3 illustrates the development of the aggregated charging load over the course of a year at all four categories of charging sites and reveals two particularities.First, an increase of the aggregated load over the course of the year can be observed for all charging sites.The strong impact of holiday periods on REc, PUc, and WOc marks the second conspicuity.The Finnish summer holidays in July and the winter holidays in December noticeably reduce the load.Both findings support the selection of a month indicator as a feature.However, since the data only covers one year, a feature month would lead to unknown indicators during the data split and is, thus, omitted.
Figure 4 provides a detailed analysis of the charging sites by visualising the charging load for three different weeks of the year.When comparing the load curve of the first week of February (green) with that of the end of November (grey), the increase in charging load over the course of a year is evident.Moreover, a clear impact of the type of day and time of the day F I G U R E 2 Methodology of the LSTM EV charging load forecast can be seen for REc, PUc, and WOc.The charging load varies significantly between weekdays and weekends, and the time of the peak load follows a clear pattern.SCc is the only charging site that exhibits a highly random load pattern.When looking at the purple plot, the influence of public holidays on the load is evident as well.For all charging sites, the load is considerably reduced during the three Christmas holidays.Consequently, to help the LSTM to detect interrelationships, each load value is linked to a quarter hour indicator, type of day indicator, and public holiday indicator, collectively representing the multivariate inputs to the LSTM.

| Scaling and encoding of data
To avoid data leakage, the time series is split into training and test sets with a 90% to 10% ratio prior to the scaling and encoding.Furthermore, 20% of the training data is used to validate the model performance during training, resulting in a 72% training, 18% validation, and 10% test split.
To accelerate network convergence and allow the LSTM to process categorical features, the load is normalised, and the indicators are encoded.The commonly used min-max scaling method is used to normalise the load values [37].The popular one hot encoding technique is used to encode the binary character of the public holiday indicator [38].For the quarter hour and type of day indicators, sine/cosine encoding is tested as well to allow a better representation of the cyclic nature of both indicators [39,40].

| Supervised learning framing and reshaping of data
Finally, the data is converted into a supervised learning problem and formatted such that the shape meets the specific LSTM requirements.In the first step, the normalised and scaled data is concatenated into a single array for every 15-min ts.Using one hot encoding, each ts possesses 106 dimensions of feature space, this number is reduced to seven while using the sine/cosine encoding.
Subsequently, the data is framed in a supervised learning manner by separating the data into input and output sequences.In this study, two approaches are examined -the stateless and stateful mode.The sliding window approach is applied for the stateless mode.In the stateful mode, the data is prepared in such a way that the fixed input sequences adjoin each other without overlapping.For both cases, the input data contains the load value and all indicators for each ts, the output only contains the load to be compared with the predicted load.
In the last step, the data is reshaped in the required 3dimensional shape, defining the number of samples, the number of input ts, and the number of features for each ts.

| Long short-term memory implementation and training
The most suitable LSTM configuration is selected in two steps.First, the LSTM is trained with initially selected hyperparameters (hp) on different approaches: the two encoding techniques, the two modes, and a varying number of input ts.
In stateless mode 4, 8, 16, and 32 ts are tested as inputs for the 4-ts forecast, 96, and 192 ts for the 96-ts prediction.The equal amount of input and output ts are applied in the stateful mode.Subsequently, based on the minimal validation loss, the most suitable approach undergoes tuning.

| Training and hyperparameter tuning implementation
The LSTM is implemented as illustrated in Figure 5, using the Sequential Model in Keras.While indices n denote the number of input features, indices tin depict the amount of input ts for each sample.Indices tout define the number of predicted load values subsequent to the input sequence.
Table 3 provides an overview of the hp selected for initial training and hp tuning.The tuning, based on a random search, is performed in hyperopt, a popular library for conducting hp optimisation in Python.A detailed description of the hyperopt is given in [40].The number of evaluations is set to 50 and the seeding is set to one to ensure the comparability between the different charging sites and reproducibility of the results.

| LSTM training process
The training proceeds as follows.The training and validation data is shown to the LSTM in the input layer.Both data sets consist of multiple samples which are successively processed by the hidden layer(s), each comprising the input and output sequences.For each sample, the concatenated features of each ts form the current input for the LSTM cell.In the stateless mode, the final state of each processed batch is removed.In the stateful mode, however, the final state is provided as the initial state for each sample in the next batch.The state is manually reset after each epoch as each epoch contains the same time series data.After the input data has been processed by either one or two hidden layers, the final state of the last cell F I G U R E 5 Architecture and training process of the long short-term memory.MSE, mean squared error in the last hidden layer is passed to a dense layer which outputs the number of ts to be predicted for each input sequence.Once a certain number of samples, defined by the batch size, is processed, the predicted values are compared with the desired real load values to calculate the training and validation loss.Based on the mean squared error (MSE) during training, the weights are then modified after each batch size by the Adam optimiser using backpropagation through time.This process is repeated until the maximum number of epochs is met or the training is terminated prematurely by the implemented early stopping callback.

| LSTM forecasting
After the completion of the random search, the LSTM with the optimal hp configuration performs the load forecast.

| Implementation of the forecast
The forecast is carried out with a batch size of one, to avoid errors in the prediction process caused by the varying batch sizes during training.For every input sequence of the test data, the model predicts the subsequent four or 96 ts.Afterwards, both the normalised real and predicted load are converted to load values in kW.The inverse normalisation is performed with the initial parameters of the min-max scaler that have been saved during the preprocessing stage for this purpose.

| Metrics
Six different error metrics are employed to evaluate the model performance.The mean absolute error (MAE) in kW, calculated according to (9), measures the precision of the forecast by averaging the error between the predicted and real load, and is selected due to its simple comprehensibility.While N depicts the number of forecasts, load t indicates the real charging load at time t, and loâd t the predicted one.However, the MAE is scale-dependent, which implies the need for additional error metrics to allow a better comparison of the different charging sites.Given its scale-independence, the popular mean absolute percentage error (MAPE) would provide an easily interpretative error metric.However, the charging load time series exhibit a charging load of zero at numerous points in time.For this reason, MAPE cannot be used for overall comparison, as the calculation is based on the division of the error by the true load value at each ts.To overcome these difficulties, two variants of the normalised mean absolute error (NMAE) are used as shown in Equations ( 10) and (11).The MAE is normalised by the mean charging load and the difference between the maximum load max and minimum load load min , respectively.
For the 96-ts forecast, three additional metrics are used to better assess how well the daily peak load can be predicted, both in terms of temporal occurrence and magnitude of the peak load.The timing and magnitude of the peak load are especially important for assessing possible congestions in the distribution network.The average peak deviation, pdev, in kW between the actual daily peak load, load dm , and predicted daily peak load, loâd dm , is calculated according to Equation (12).While D specifies the number of predicted days, dm denotes the daily peak.Likewise, the MAPE in % is calculated according to Equation (13).Finally, Equation ( 14) specifies the average time deviation, tdev, in min separating the real and predicted peak load.Variable slot corresponds to the respective 15-min interval of the real and the predicted peak load.-

| RESULTS
Having covered the methodology of this work, this chapter addresses the training, tuning, and prediction results.

| Initial training results
Table 4 illustrates the selected approach for each charging site and forecast horizon based on the minimum validation loss obtained during initial training.In all cases the minimum loss is recorded while training the LSTM with 128 units.The small epoch number for most of the charging sites indicates that the LSTM suffers from early overfitting.Moreover, the different modes, encoding techniques, and input ts only yield small differences in the validation loss.However, in all the cases the lowest MSE is obtained while using the stateless mode.The sine/cosine encoding shows a slightly superior performance in most cases.Finally, using the same amount of input ts yields the lowest MSE for the 96-ts forecast.For the 4-ts prediction, 16 input ts results in the lowest validation loss for SCc, Rec, and WOc, whereas four input ts achieve the best fit for PUc.

| Hyperparameter tuning results
The maximum and minimum validation loss obtained during tuning, and the chosen superior hp combination is given in Table 5.The maximum and minimum validation loss differ considerably.The highest MSE values are obtained for the 4-ts prediction while using a high dropout or learning rate.For the 96-ts forecast, a small number of units lead to the highest validation losses.While comparing the minimum loss between random search and initial training, it is evident that the hp tuning only yields improvements for half of the forecasts.The lowest losses are obtained with a high number of units (64 or 128).In most cases, an LSTM with only one hidden layer seems sufficient.Only for the REc 96-ts forecast and WOc 4ts prediction the minimum loss is recorded while using two hidden layers.All batch sizes are applied for the different forecasts.Dropout generally does not have a positive effect on training, except for the 96-ts REc forecast.Lastly, in most of the cases, the learning rate of 0.001 yields the lowest loss.

| Forecasting results
In the following sections, the forecasting results are analysed both graphically and numerically.

| Graphical results
Figure 6 illustrates the forecasting results for the 36 days of test data for each category of charging site for both the 4-ts (green) and 96-ts (purple) prediction.While comparing the outcomes with the real charging load (grey), it becomes apparent that the 4-ts prediction achieves superior results compared to the 96-ts forecast.The 4-ts forecast yields a relatively precise picture of the real load curve in grey, with no major outliers to be seen.
Considering the 96-ts prediction results, three shortcomings can be identified.First, the LSTM is not capable of predicting the charging load on public holidays.The projected load almost exclusively exceeds the real load by a substantial margin and exhibits a course similar to that of non-holidays.Second, the LSTM is not able to accurately predict the level of real peak load for most days, frequently exceeding the predicted peak load substantially.Finally, the LSTM struggles to anticipate the impact of holiday periods.This is particularly evident for REc, where the forecast significantly exceeds the reduced real charging load during the winter holiday period.

| Numerical results
Table 6 summarises the numerical results.As previously evidenced during the graphical analysis, reducing the forecast horizon from 96 ts to 4 ts considerably reduces the prediction error.The MAE for REc, PUc, and WOc can be more than  halved in most of the cases.For SCc, the MAE is reduced in all cases by more than 2 kW.The lowest overall MAE for both the forecast horizons is recorded for REc.WOc generates the second lowest MAE values, followed by PUc.The highest MAE values are seen for SCc.
A different picture emerges while looking at the NMAE1 and NMAE2 results.For the 4-ts prediction, the lowest NMAE1 is obtained for PUc, followed by REc, WOc, and SCc.Conversely, using the NMAE2, the most precise results are seen for WOc, followed by PUc, Rec, and SCc.For the 96-ts forecast the lowest NMAE1 again is obtained for PUc and REc, however, the least accurate prognosis is given for WOc this time.In contrast, the lowest NMAE2 is seen for WOc, followed by SCc and PUc.REc ranks last.
While comparing the clustered NMAE1 results, it can be seen that the 4-ts forecast yields the most accurate forecast for PUc and REc.SCc and WOc score the poorest results.Conversely, while using the NMAE2 as the evaluation criterion, varying results are visible depending on the type of day.For weekdays the most accurate predictions are seen for WOc followed by PUc, Rec, and SCc.While REc yields the most exact forecast for weekends, WOc, PUc, and SCc take the second, third, and fourth place.For public holidays, REc ranks first again, followed by SCc, PUc, and WOc.Similar findings to the NMAE1 results of the 4-ts forecast can be seen for the NMAE1 results of the 96-ts forecast.REc and PUc achieve the most precise scores in most of the cases.However, for public holidays, SCc outperforms PUc and comes in the second place, behind REc.Looking at the NMAE2 results, the poorest performance for weekdays can be observed for Rec, whereas SCc yields the most accurate results.On weekends SCc ranks first again, and WOc ranks the last.For public holidays, WOc performs particularly poor again.REc obtains the most favourable score.

F I G U R E 6
Forecasting results for the 36 days of test data illustrated for all categories of charging sites UNTERLUGGAUER ET AL.
Table 7 quantifies the LSTM's difficulties in predicting peak loads with a forecast horizon of 96 ts in numerical terms.The pdev, the MAPE, and the tdev are given for each charging site.For all forecasts, the average deviation between true and predicted peak load amounts to 29.16 kW for SCc, 6.37 kW for REc, 14.17 kW for PUc, and 17.85 kW for WOc.The MAPE amounts to 53.19% for SCc, 34.38% for REc, 37.16% for PUc, and 128.97% for WOc.The average deviation between real and predicted time of the peak load in minutes are 153, 88, 104, and 285 for SCc, REc, PUc, and WOc, respectively.
Examining the scores grouped by weekdays, weekends, and public holidays, several discrepancies between the charging sites can be highlighted.For SCc, on weekends and public holidays, the absolute and relative errors can be reduced relative to weekdays.In addition, the time deviation between true and predicted occurrence of the peak load is significantly lowered as well.For REc, in contrast, the highest relative and absolute errors are seen for public holidays, although the time deviation is minimised.For PUc and WOc, according to the MAPE, the most precise results are achieved during the week.For WOc predictions on weekends, the absolute error amounts to 4.73 kW and the MAPE to 100% due to a constant prediction of 0 kW.Therefore, no deviation between the time of the true and predicted peak load can be calculated.The highest error scores between the real and predicted peak load are recorded on public holidays for both charging sites and the time deviation between true and predicted peak load amounts to a multiple of the deviation during the week.

| DISCUSSION
In this section, the initial training, hp tuning, and forecasting results are discussed in more detail.

| Initial training results
Four key findings can be summarised.To begin with, the inclusion of multivariate inputs shows a positive influence on the training results in terms of validation loss.However, a detailed analysis of the impact on the training results and forecasting results was not part of the analysis.Since the training data used in this work spans less than one year for each category of charging site, the inclusion of multivariate inputs is selected to help the LSTM identify patterns in the data.With multiyear data available, the importance of including multivariate inputs may decrease or become obsolete.
Next, the early overfitting indicates that the complexity of the model, relative to the size of the available data, is too high, and due to changing load characteristics throughout the year, the validation data is not fully representative for the entire dataset.The access to charging data measured over a longer period is, thus, of great interest to enhance the performance of the model in future.
Furthermore, the different encoding techniques and input ts variants only cause negligible differences in the validation loss.Thus, the extra time needed to implement and train the LSTM with various variants cannot be justified.Hence, the selection of the same number of input ts and encoding choice for all sites seems to be more rational.Whereas the sine/cosine encoding offers a viable encoding variant, 16 input ts for the 4-ts and 96 input ts for the 96-ts prediction seem to be appropriate.Lastly, the stateful mode does not provide benefits compared to the stateless mode under the given conditions.However, when multi-year data becomes available in the future, the stateful mode may reveal its strengths, as seasonal relationships can be identified by the LSTM.Consequently, with the availability of data over several years, a further comparison of the different modes is advisable.

| Hyperparameter tuning results
Three relevant conclusions can be drawn.First, the significant variation between the minimum and maximum loss during tuning reveals that the different hp combinations exert a major impact on the generalisability of the LSTM and, ultimately, on its predictive power.Thus, the importance of hp tuning for selecting the most suitable LSTM model is outlined.
Moreover, the varying combinations of hp exert a fundamentally different impact on the validation loss for each charging site category and forecast horizon.Thus, it is essential to perform the hp tuning separately for each use case.
Finally, random search shows only minor or no enhancements compared with the initial training.This might be due to the limited amount of evaluations and search space.A higher number of executions, the inclusion of other hp in the search space, or the use of a more subtle method, like the bayesian hp tuning, could lead to further improvements.

| Forecasting results
The results of the forecast are discussed below, followed by implications for the practical application of the forecast.

| Differences between the different categories of charging sites
The comparison of the forecasting results of the different charging sites revealed that the evaluation is highly dependent on the metric.However, SCc almost solely yields the poorest results for the 4-ts prediction, indicating that the fluctuating load hinders the predictive power of the LSTM.
Further, evident discrepancies involve the varying results for weekdays, weekends, and public holidays.While the lowest NMAE1 is obtained for SCc and REc on public holidays, the public holiday load predictions for PUc and WOc are significantly less accurate than the predictions on weekdays and weekends.This finding is attributable to the distinct load profile characteristics rather than to the different forecasting ability of the LSTM.While the load for PUc and WOc shifts radically on public holidays, the change for REc is less pronounced.With SCc, the load profile also varies noticeably, but the entire load series shows strongly fluctuating load characteristics.Thus, the impact of public holidays remains small.Similar findings can be seen while looking at the peak deviation and MAPE results.While for SCc and Rec, the MAPE difference between weekdays and public holidays amounts to only 3.75 and 5.09 percentage points, respectively, the figure for public and workplace charging increases by 97.96 and 906.63 percentage points, respectively.Once again, the discrepancies can be explained by the impact of public holidays on the load pattern.Analysing the weekday results, the significant discrepancies in the MAPE and time deviation, likewise, indicate that the characteristics of the load profile decisively impact the accuracy of the forecast.The highly irregular load profile for SCc results in the poorest accuracy for both, the predicted peak load level and the time of the peak load.WOc, on the other hand, exhibits the steadiest pattern concerning the time of the peak load, which is why the average time deviation is the lowest at 35 min.

| Overall findings
The superior performance of the 4-ts prediction compared to the 96-ts forecast is attributable to two factors.Due to the shorter forecast horizon, the preceding load values are shown to the LSTM more frequently and assist the LSTM in mapping the height and course of the load more precisely.Moreover, the poor prediction results for the 96-ts forecast might be traced back to the strongly altered load profile of the test data, caused by the increase of charging load throughout the year and the winter holiday period.Therefore, the training and validation data might have not been fully representative of the test data, limiting the LSTM's predictive power.

| Implications for aggregators and network operators
There are two courses of action on how aggregators and network operators can implement the LSTM to achieve optimised forecasts for their respective use case.First, it is beneficial to limit the load forecast to weekdays.Weekends and public holidays are accompanied by changes in the user behaviour that are difficult to predict and exhibit a much lower aggregation potential and risk of bottlenecks due to the reduced charging load.By focussing the forecast on weekdays, a higher prediction accuracy can be obtained.
Second, the forecast should only be carried out for the periods of a day with the highest charging load, where the aggregation potential is the highest, and bottlenecks in the distribution network are most likely.By shortening the forecast period, the accuracy is improved as shown in this work.The forecast will be further enhanced by focussing the LSTM on a specific period of the day.UNTERLUGGAUER ET AL.
-13 WORK This study proposes a novel multivariate long short-term memory approach for multi-step EV charging load forecasting with two different prediction horizons.The performance of the forecasting approach is evaluated and compared between four different categories of charging sites.The results show that the distinct characteristics of the different charging sites influence the predictability of the charging load, but that the evaluation is highly dependent on the chosen metric.It is also demonstrated that with an increasing forecast horizon, the accuracy diminishes as well.Reducing the forecast horizon from 96 to four time steps, the MAE is more than halved in most cases, and amounts to 4.35 kW for shopping centre, 1.53 kW for residential, 2.7 kW for public, and 1.85 kW for workplace charging.In general, the forecasting accuracy tends to be the best at residential and workplace charging sites.The weakest accuracy is found at shopping centres.
The findings of this work benefit mainly two stakeholders.Aggregators have great interest in a reliable load forecast to sell EV flexibilities to the energy market.Network operators, in contrast, are keen on forecasting the charging load to identify possible bottlenecks in the distribution network.To apply the proposed LSTM model most effectively for both use cases, future studies should consider two aspects.
To address the outlined weaknesses of the 96-time step forecast, the LSTM will be trained on data collected over a longer time period to ensure that the training and validation data is truly representative of the test data.With data available over several years, the LSTM will be extended by further indicators, such as the month, the stateful mode will be investigated again, and time series cross validation will be performed to increase the robustness of the model and overcome early overfitting.
Furthermore, a future research objective involves the development of a customised LSTM, tailored to each charging site.The forecast will be targeted to weekdays, since user behaviour on weekends and holidays is often difficult to predict and, in many cases, as with workplace charging, the aggregation potential is not sufficient enough.Additionally, the prediction will be narrowed down to the most suitable time periods of the day at each charging site when a sufficient volume of EVs is available for pooling the EV flexibility and bottlenecks caused by simultaneous charging, such as 4:00 PM to 8:00 PM for residential charging, 8:00 AM to 12:00 PM for public charging, or 6:00 AM to 10:00 AM for workplace charging.
and W c;h label weight matrices, b f , b i , b o , and b c are bias vectors.Operations þ and ⊙ denote the element-wise addition and multiplication, and ϕ depicts the tanh activation function.The forget gate determines which information from the previous cell state, c t−1 , is preserved.The cell state represents the long-term memory of the LSTM.The input gate decides which information from the candidate cell state, ct , is to be used to update the previous cell state.The output gate controls which part of the new cell state, c t , to output and pass as the hidden state, h t , the short-term memory, to the next LSTM cell.The current input, x t , and the previous hidden state, h t−1 , are multiplied with their respective weights along the bias vectors form the inputs for all three gates.The sigmoid function, σ, introduces non-linear characteristics to the gates and decides which signals pass the gates.While a value of zero causes signals to disappear, a value of one ensures that the signals pass the gate.

F I G U R E 3
Aggregated electric vehicle charging load over the course of one year illustrated for each category of charging site F I G U R E 4 Aggregated charging at each category of charging site exemplified for three different weeks

a 10 -
All MSE values (validation loss) are given in units of 10 −3 .b SL = Stateless mode.c S/C = Sine/Cosine encoding, OH = One hot encoding.UNTERLUGGAUER ET AL.

T A B L E 1
Literature review on studies addressing the issue of short-term EV charging load forecasting

Year ref. Proposed model(s) Baseline model(s) Superior model Forecast approach Reso- Lution Data characteristics Data origin [year]
Monte carlo simulation; MPSF, modified pattern sequence-based forecast; NILA, Lion algorithm by niche immune; R-ANN, rough artificial neural network; RR-ANN, recurrent rough artificial neural network; RF, Random forest; SAE, stacked auto-encoder; SARIMAX, Seasonal autoregressive integrated moving average (with exogenous variables; TWDPNN, time weighted dot product based nearest neighbour; SARIMA, Seasonal autoregressive integrated moving average; WPA, Wolf pack algorithm.Characteristics of the EV charging data T A B L E 1 (Continued) Abbreviations: ANN, Artificial neural network; ARIMA, autoregressive integrated moving average; CNN, Convolutional neural network; DBN, Deep belief network; DNN Deep neural network EV, electric vehicle; FARIMA, Fractional autoregressive integrated moving average; FC, Fuzzy clustering; GRU, gated recurrent unit; LSTM, long short-term memory; MCS, T A B L E 2 Abbreviations: EV, electric vehicle.

ts 96-ts 4-ts 96-ts 4-ts 96-ts 4-ts 96-ts
Summary of the best approach after initial training T A B L E 4 Summary of the peak deviation, MAPE and time deviation results of the 96-ts prediction