Traditionally, power grids have been sized with large safety margins to consider low-probability events. Thus, there can be limits imposed on generation even when the low-probability events do not occur. Generation and load forecasting coupled with powerflow analysis will allow E-REDES to estimate the power flowing in each line of the HV and MV grid. Thus, E-REDES can allow for increased generation in the grid using the existing infrastructure. It is possible that the grid generation hosting capacity can be increased by 20%, or even more depending on the particular case.
PREDIS is a Big Data time series forecasting project whose goal is to predict 200k load diagrams, each with a 15-minute granularity, daily for all medium and high-voltage installations of the Portuguese electrical grid.
To tackle this ambitious task, the PREDIS daily inference pipeline relies on an ensemble of three state-of-the-art forecasting models—Elastic Net, LightGBM, and Prophet—together with a Baseline model that outputs the previous day’s load data. Each day, the individual model whose inference on the previous day over a given time series performed better (with respect to the MAE metric) is chosen to infer the next three days.
Both training and inference pipelines are heavily data-dependent. The former used two years of historical data (comprising 14.5B records) while the latter uses every day, and for each individual time series, a year’s worth of historical data to fit the series and infer it. This compute-intensive workload is leveraged exclusively by Databricks and Spark.
Architecture and Tech Stack
Here's a summary of our tech stack:
Data Sources
This main data source provides meter data from all high and medium voltage installations across the Portuguese electrical grid. Every day, PREDIS ingests a staggering 60 million new records. The input dataset encompasses 96 measurements per asset for 100,000 installations and six types of energy consumption, comprising the core of our load forecasting efforts.
This data source provides essential registry and geographic information for all electrical grid assets and installations.
The IPMA (Portuguese Institute for Sea and Atmosphere) source supplies weather forecasts up to three days in advance, which are incorporated as exogenous variables in our models. These weather forecasts are pivotal for creating external factors that influence energy demand, such as temperature fluctuations and precipitation, thereby enhancing the accuracy of our predictions.
ETL Workflow
Our ETL (Extract, Transform, Load) comprises several Databricks notebooks each playing a specific role in transforming the raw data. Below is an overview of the data transformations that form the backbone of PREDIS.
All data sources are first imported into the bronze database (notebook import_data). Each source undergoes an individual processing before combining all sources in a single master data table (this time persisted in the silver database).
To incorporate weather data into our forecasts, we perform a nearest neighbor join (notebook nnjoin_sweg_ipma) to determine the closest weather forecast grid point for each installation. This step ensures that weather data is accurately aligned with the specific locations of our assets.
A master data table is created to maintain a comprehensive record of all installation keys and static attributes. This table serves as a reference for linking dynamic data with static installation information.
The raw meter data, which is reported in six separate channels, is aggregated into two main channels: active and reactive energy. This aggregation simplifies the dataset and focuses the forecasting models on the most relevant data to the business.
Missing data points due to communication failures or other issues are addressed by reindexing the timeseries (notebook timeseries_reindexed). This process involves adding missing timesteps according to a fixed start and end date, ensuring continuity in the time series data.
Finally, we compile the inference dataset by joining all processed data sources, including weather forecast parameters required for accurate predictions (notebooks forecast_horizon_index and base_inference_dataset). This dataset forms the basis for generating forecasts and is used to train and validate the forecasting models.
Scalability
The ETL and pre-processing pipeline is scalable by design. The process manages huge amounts of data by leveraging Databricks and Apache Spark with PandasUDFs for distributed data processing.
Time Series Forecasting
For each load diagram we daily fit three local timeseries models, namely:
In all three models, weather forecast variables and holidays are incorporated as exogenous variables to enhance the accuracy of the predictions. This approach ensures you can capture a wide range of patterns and trends in your time series data, leading to more accurate and robust forecasts.
Cross-validation
A sliding window approach was employed for cross-validation. The two-year training dataset was evenly split into seven segments, each based on a 60-day interval. The models were trained using the training set to predict the validation set up to 5 days ahead, and error metrics were estimated. The average of all metric scores across these folds provided the final validation score, which was used to determine the optimal hyperparameters.
Given the impracticality of an exhaustive hyperparameter tuning strategy for this big data problem, we adopted a Bayesian Optimization approach implemented through the Optuna library, instead of a brute-force method like grid search. Optuna leverages past experiments to suggest new hyperparameters, aiming to find the optimal solution by minimizing an error metric.
The training process was optimized by conducting 20 Optuna tuning experiments for both Prophet and LightGBM models. Due to the faster adjustment time of Elastic Net Regression, it was trained with 28 tuning experiments per time series.
The metrics used include:
Considering the large-scale time series inference dataset, we used PandasUDFs to parallelize the training across the entire universe of time series. This process took approximately six days per model for the 200K time series. A key optimization tip from Databricks was to turn off Adaptive Query Execution (AQE) and manually repartition by the distinct number of time series.
Inference
Based on the optimized hyperparameters for each time series model, the inference pipeline includes the ETL processes described previously, followed by model inference. Given the substantial volume of data, we used once again PandasUDFs to evenly distribute the inference workload. The pipeline incorporates three machine learning models along with a baseline model that predicts the previous day's values. The final model chosen for inference each day is the one that performed best on the previous day, as exemplified below.
While the Linear Regression model performed better in this specific instance, all models are generally used for inference. As anticipated, the inference results were slightly less accurate compared to the training phase. This discrepancy arises because the models were trained with data from 2022 and 2023, and the distribution of each time series could have meanwhile changed. To maintain model accuracy, we plan to retrain the models annually to update the optimized hyperparameters.
The inference process takes approximately three hours to execute using a dedicated cluster with 40 nodes.
Test dataset
A daily metrics pipeline runs after the inference pipeline, in a separate Databricks Workflow with lighter compute infrastructure. The objective is to calculate daily error metrics for all timeseries and all models, by comparing the real measurements with the past forecasted values. In this case only MAE (Mean Absolute Error) and MSE (Mean Normalized Error) are recorded for up to 2 years of history. Later, these values are used to report the official PREDIS test metrics, by normalizing each metric by the installation's nominal power, and creating different aggregations, for example “Normalized RMSE by Model, Installation and Channel”.
This metrics history will be used to create a “Model Monitoring” dashboard where business users can inspect the quality of the forecasts both at an individual and aggregated level.
Snippets of code
def prophet_hyperparameter_optimization(df, df_folds):
# df: Pandas dataframe with a full timeseries
# df_folds: dataframe with crossvalidation folds to run (start_date and end_date of each fold)
# save key values
key = df["key"].iloc[0]
# create timeseries, set time index, sort index
df = (
df[["timestamp_utc", "value", *exog_columns]]
.set_index("timestamp_utc")
.sort_index()
)
df.index.freq = "15min"
# use Optuna library to find the combination of hyperparameters that minimizes the validation metric (MSE)
study = optuna.create_study(
direction="minimize",
sampler=optuna.samplers.TPESampler(
n_startup_trials=n_startup_trials, multivariate=True
),
)
# optuna optimize ()
study.optimize(
lambda trial: prophet_crossvalidation(trial, df, df_folds), n_trials=n_trials
)
# get best metrics
study_scores = {**study.best_trial.user_attrs}
study_scores["key"] = key
# add hyperparameters
study_scores["hyperparameters"] = json.dumps(study.best_params)
# add training time of best trial
study_scores["duration"] = (
study.best_trial.datetime_complete - study.best_trial.datetime_start
).seconds
# convert to pandas
study_scores = pd.DataFrame([study_scores])
return study_scores[
["key", "hyperparameters", "mean_score_mae", "mean_score_mse", "duration"]
]
Figure 1. Example of Python function that performs Bayesian Hyperparameter optimization with Optuna.
# disable spark AQE config, to avoid incompatibility with UDFs
# AQE might overwrite the repartition operation defined below
spark.conf.set("spark.sql.adaptive.enabled", "false")
# repartition by distinct number of timeseries and key columns
# groupby key columns and use applyInPandas to execute the crossvalidation UDF
# the "df_parameters" is a dictionary with all validation sets to run
# this means each core of each worker/executor will run its own task/crossvalidation
key_columns = ["key"]
df = (
df.repartition(df.select(*key_columns).distinct().count(), *key_columns)
.groupBy(*key_columns)
.applyInPandas(
lambda df: prophet_hyperparameter_optimization(df, df_parameters), schema
)
)
Figure 2. Example of distributed training using Pandas UDF and manually repartitioning by distinct number of timeseries keys. Adaptive Query Execution (AQE) must be disabled to ensure that the repartition takes effect.
Final thoughts and future work
The PREDIS project represents a significant achievement for E-REDES in time series forecasting for energy demand and supply.
Achievements
Lessons Learned
Future Work
In conclusion, the PREDIS project has successfully demonstrated the application of state-of-the-art forecasting models and advanced data processing techniques. The insights gained pave the way for future innovations in time series forecasting and energy management for E-REDES.
About E-REDES
E-REDES is a Distribution System Operator (DSO) supplying electricity to all connected consumers across Portugal.
Mission: supply electricity to all consumers ensuring quality, security and efficiency, while promoting a sustainable grid development that supports the energy transition and is able to provide, in a neutral way, services to the market agents.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.