Databricks Community

JoseAlfonso · ‎02-05-2025

Traditionally, power grids have been sized with large safety margins to consider low-probability events. Thus, there can be limits imposed on generation even when the low-probability events do not occur. Generation and load forecasting coupled with powerflow analysis will allow E-REDES to estimate the power flowing in each line of the HV and MV grid. Thus, E-REDES can allow for increased generation in the grid using the existing infrastructure. It is possible that the grid generation hosting capacity can be increased by 20%, or even more depending on the particular case.

PREDIS is a Big Data time series forecasting project whose goal is to predict 200k load diagrams, each with a 15-minute granularity, daily for all medium and high-voltage installations of the Portuguese electrical grid.

To tackle this ambitious task, the PREDIS daily inference pipeline relies on an ensemble of three state-of-the-art forecasting models—Elastic Net, LightGBM, and Prophet—together with a Baseline model that outputs the previous day’s load data. Each day, the individual model whose inference on the previous day over a given time series performed better (with respect to the MAE metric) is chosen to infer the next three days.

Both training and inference pipelines are heavily data-dependent. The former used two years of historical data (comprising 14.5B records) while the latter uses every day, and for each individual time series, a year’s worth of historical data to fit the series and infer it. This compute-intensive workload is leveraged exclusively by Databricks and Spark.

Architecture and Tech Stack

Here's a summary of our tech stack:

Cloud Provider: Azure
Data Storage: Corporate and project data are stored in a dedicated Data Lake with Delta tables.
Development Environment: Fully developed in Databricks Notebooks using PySpark, Python, and Vectorized UDFs.
Notebooks Orchestration: Both training and inference pipelines are orchestrated through Databricks Workflows and running in Job Clusters.
Resource Orchestration: High-level orchestration through Azure Data Factory triggering Databricks Workflow via API.
Forecast access: Forecasts written to an Oracle database and accessed by the network planning and optimization E-REDES system
Data Loading: Outputs are written into an Oracle Database through Azure Data Factory.
DevOps: CICD pipelines managed with Azure DevOps. Model training is performed in the Development environment, while inference runs daily in the Production environment.

Data Sources

Load diagrams

This main data source provides meter data from all high and medium voltage installations across the Portuguese electrical grid. Every day, PREDIS ingests a staggering 60 million new records. The input dataset encompasses 96 measurements per asset for 100,000 installations and six types of energy consumption, comprising the core of our load forecasting efforts.

Grid technical information

This data source provides essential registry and geographic information for all electrical grid assets and installations.

Weather forecasts

The IPMA (Portuguese Institute for Sea and Atmosphere) source supplies weather forecasts up to three days in advance, which are incorporated as exogenous variables in our models. These weather forecasts are pivotal for creating external factors that influence energy demand, such as temperature fluctuations and precipitation, thereby enhancing the accuracy of our predictions.

ETL Workflow

Our ETL (Extract, Transform, Load) comprises several Databricks notebooks each playing a specific role in transforming the raw data. Below is an overview of the data transformations that form the backbone of PREDIS.

Data Ingestion and Initial Processing

All data sources are first imported into the bronze database (notebook import_data). Each source undergoes an individual processing before combining all sources in a single master data table (this time persisted in the silver database).

Weather Forecast Integration

To incorporate weather data into our forecasts, we perform a nearest neighbor join (notebook nnjoin_sweg_ipma) to determine the closest weather forecast grid point for each installation. This step ensures that weather data is accurately aligned with the specific locations of our assets.

Master Data Table

A master data table is created to maintain a comprehensive record of all installation keys and static attributes. This table serves as a reference for linking dynamic data with static installation information.

Data Aggregation

The raw meter data, which is reported in six separate channels, is aggregated into two main channels: active and reactive energy. This aggregation simplifies the dataset and focuses the forecasting models on the most relevant data to the business.

Handling Missing Data

Missing data points due to communication failures or other issues are addressed by reindexing the timeseries (notebook timeseries_reindexed). This process involves adding missing timesteps according to a fixed start and end date, ensuring continuity in the time series data.

Inference Dataset Creation

Finally, we compile the inference dataset by joining all processed data sources, including weather forecast parameters required for accurate predictions (notebooks forecast_horizon_index and base_inference_dataset). This dataset forms the basis for generating forecasts and is used to train and validate the forecasting models.

Scalability

The ETL and pre-processing pipeline is scalable by design. The process manages huge amounts of data by leveraging Databricks and Apache Spark with PandasUDFs for distributed data processing.

Training Pipeline: Processed two years of historical data resulting in a dataset with 14.5 billion records.
Inference Pipeline: Processes every day one year of historical data to fit the models and generate forecasts.

Time Series Forecasting

For each load diagram we daily fit three local timeseries models, namely:

Elastic Net Regression: This model employs an autoregressive approach which assumes that the current value of a time series is influenced by its past values. Elastic Net Regression enhances the basic linear regression model by incorporating regularization to manage overfitting and model complexity. Specifically, L2 regularization increases the model's resilience to multicollinearity, while L1 regularization aids in excluding irrelevant features. To extend forecasts beyond t+1, a recursive strategy is employed, where the model's previous forecast is used as an input feature to predict subsequent time steps.

Combines the Best of Both Worlds: Elastic Net merges the strengths of Ridge and Lasso regression techniques. It helps in picking out important features like Lasso and stabilizes the model with regularization like Ridge.
Handles Multicollinearity: Elastic Net is great at dealing with multicollinearity, which means it can manage situations where features are highly correlated, thanks to its Ridge regularization.
Selects Key Features: If you have lots of predictors and suspect many might be irrelevant, Elastic Net can help select the most important ones thanks to the Lasso regularization.

LightGBM: This gradient boosting tree-based model includes built-in feature importance and constructs decision trees sequentially. Each tree is designed to correct the errors of its predecessor, allowing the model to effectively capture seasonality and complex patterns in time series data. Similar to Elastic Net Regression, LightGBM (Light Gradient Boosting Machine) extends forecasts using a recursive strategy.

Handles Complex Relationships: LightGBM excels at capturing complex, non-linear patterns in your data that simpler models might miss.
Efficient and Fast: It's designed to be quick and efficient with memory, making it ideal for large datasets.
High Accuracy: LightGBM often outperforms traditional machine learning models in terms of accuracy due to its ability to capture intricate patterns.
Seasonality and Trends: Although not specifically for time series, LightGBM can incorporate time-based features (like month, day, hour) to help capture seasonal and trend patterns.

Prophet: Developed by Facebook, Prophet is a forecasting model that identifies seasonal patterns in time series data. Known for its intuitive approach, Prophet typically yields good results with minimal tuning and effort.

Built for Time Series: Prophet, developed by Facebook, is specifically designed for forecasting time series data. It handles seasonality, holidays, and trend changes effectively.
User-Friendly: It is easy to use and allows for intuitive parameter tuning, making it accessible even to those without deep expertise in time series analysis.
Seasonality: Prophet can handle multiple seasonalities (daily, weekly, yearly) and even custom seasonal patterns, leveraging its built-in capabilities for handling various seasonal effects and holidays.
Trends and Anomalies: Prophet can detect and model long-term trends and identify anomalies or changepoints in your data.

In all three models, weather forecast variables and holidays are incorporated as exogenous variables to enhance the accuracy of the predictions. This approach ensures you can capture a wide range of patterns and trends in your time series data, leading to more accurate and robust forecasts.

Cross-validation

A sliding window approach was employed for cross-validation. The two-year training dataset was evenly split into seven segments, each based on a 60-day interval. The models were trained using the training set to predict the validation set up to 5 days ahead, and error metrics were estimated. The average of all metric scores across these folds provided the final validation score, which was used to determine the optimal hyperparameters.

Given the impracticality of an exhaustive hyperparameter tuning strategy for this big data problem, we adopted a Bayesian Optimization approach implemented through the Optuna library, instead of a brute-force method like grid search. Optuna leverages past experiments to suggest new hyperparameters, aiming to find the optimal solution by minimizing an error metric.

The training process was optimized by conducting 20 Optuna tuning experiments for both Prophet and LightGBM models. Due to the faster adjustment time of Elastic Net Regression, it was trained with 28 tuning experiments per time series.

The metrics used include:

Mean Squared Error (MSE): Used for minimizing error during optimization.
Normalized Mean Absolute Error (NMAE) and Normalized Mean Squared Error (NMSE): Used for reporting metrics to business stakeholders. The installed power of each installation was used to normalize these metrics.

Considering the large-scale time series inference dataset, we used PandasUDFs to parallelize the training across the entire universe of time series. This process took approximately six days per model for the 200K time series. A key optimization tip from Databricks was to turn off Adaptive Query Execution (AQE) and manually repartition by the distinct number of time series.

Inference

Based on the optimized hyperparameters for each time series model, the inference pipeline includes the ETL processes described previously, followed by model inference. Given the substantial volume of data, we used once again PandasUDFs to evenly distribute the inference workload. The pipeline incorporates three machine learning models along with a baseline model that predicts the previous day's values. The final model chosen for inference each day is the one that performed best on the previous day, as exemplified below.

While the Linear Regression model performed better in this specific instance, all models are generally used for inference. As anticipated, the inference results were slightly less accurate compared to the training phase. This discrepancy arises because the models were trained with data from 2022 and 2023, and the distribution of each time series could have meanwhile changed. To maintain model accuracy, we plan to retrain the models annually to update the optimized hyperparameters.

The inference process takes approximately three hours to execute using a dedicated cluster with 40 nodes.

Test dataset

A daily metrics pipeline runs after the inference pipeline, in a separate Databricks Workflow with lighter compute infrastructure. The objective is to calculate daily error metrics for all timeseries and all models, by comparing the real measurements with the past forecasted values. In this case only MAE (Mean Absolute Error) and MSE (Mean Normalized Error) are recorded for up to 2 years of history. Later, these values are used to report the official PREDIS test metrics, by normalizing each metric by the installation's nominal power, and creating different aggregations, for example “Normalized RMSE by Model, Installation and Channel”.

This metrics history will be used to create a “Model Monitoring” dashboard where business users can inspect the quality of the forecasts both at an individual and aggregated level.

Snippets of code

def prophet_hyperparameter_optimization(df, df_folds):
   # df: Pandas dataframe with a full timeseries
   # df_folds: dataframe with crossvalidation folds to run (start_date and end_date of each fold)

   # save key values
   key = df["key"].iloc[0]

   # create timeseries, set time index, sort index
   df = (
       df[["timestamp_utc", "value", *exog_columns]]
       .set_index("timestamp_utc")
       .sort_index()
   )
   df.index.freq = "15min"

   # use Optuna library to find the combination of hyperparameters that minimizes the validation metric (MSE)
   study = optuna.create_study(
       direction="minimize",
       sampler=optuna.samplers.TPESampler(
           n_startup_trials=n_startup_trials, multivariate=True
       ),
   )

   # optuna optimize ()
   study.optimize(
       lambda trial: prophet_crossvalidation(trial, df, df_folds), n_trials=n_trials
   )

   # get best metrics
   study_scores = {**study.best_trial.user_attrs}
   study_scores["key"] = key

   # add hyperparameters
   study_scores["hyperparameters"] = json.dumps(study.best_params)

   # add training time of best trial
   study_scores["duration"] = (
       study.best_trial.datetime_complete - study.best_trial.datetime_start
   ).seconds

   # convert to pandas
   study_scores = pd.DataFrame([study_scores])

   return study_scores[
       ["key", "hyperparameters", "mean_score_mae", "mean_score_mse", "duration"]
   ]

Figure 1. Example of Python function that performs Bayesian Hyperparameter optimization with Optuna.

# disable spark AQE config, to avoid incompatibility with UDFs
# AQE might overwrite the repartition operation defined below
spark.conf.set("spark.sql.adaptive.enabled", "false")

# repartition by distinct number of timeseries and key columns
# groupby key columns and use applyInPandas to execute the crossvalidation UDF
# the "df_parameters" is a dictionary with all validation sets to run
# this means each core of each worker/executor will run its own task/crossvalidation
key_columns = ["key"]
df = (
   df.repartition(df.select(*key_columns).distinct().count(), *key_columns)
   .groupBy(*key_columns)
   .applyInPandas(
       lambda df: prophet_hyperparameter_optimization(df, df_parameters), schema
   )
)

Figure 2. Example of distributed training using Pandas UDF and manually repartitioning by distinct number of timeseries keys. Adaptive Query Execution (AQE) must be disabled to ensure that the repartition takes effect.

Final thoughts and future work

The PREDIS project represents a significant achievement for E-REDES in time series forecasting for energy demand and supply.

Achievements

Advanced Forecasting Capabilities: Implementing Elastic Net Regression, LightGBM, and Prophet models alongside with a Baseline model has enabled accurate daily forecasts for approximately 200k load diagrams, providing precise and actionable predictions for medium and high voltage installations in the Portuguese grid.
Scalable and Efficient Pipeline Design: Our ETL and pre-processing pipeline, developed using Databricks and Apache Spark with PandasUDFs, has effectively managed large-scale data demands. Bayesian Optimization with Optuna has enhanced model performance in big data environments.
Effective Inference Process: A high-performance inference pipeline, run on a 40-node cluster, enables daily forecasts and model selection, demonstrating the capacity to handle large datasets and generate timely predictions.

Lessons Learned

Model Selection and Ensemble Methods: The importance of selecting the most effective model for each time series has been emphasized. Future work will refine ensemble methods and integrate new models to enhance forecast accuracy.
Adaptation to Data Changes: Regular retraining is crucial for maintaining high forecast accuracy, given the observed degradation in model performance over time. Annual updates to the models are planned to adapt to evolving data distributions.
Optimization Strategies: Effective use of PandasUDFs and manual repartitioning has improved the efficiency of cross-validation and inference processes. Future efforts will explore further optimization techniques to streamline these processes.

Future Work

Exploration of Global Models: Investigating global models as alternatives to current local models could improve forecasting accuracy.
Development of Automated Model Selection Mechanisms: Creating automated model selection methodology based on historical performance patterns could enhance the efficiency of the inference pipeline.
Expansion of Model Ensemble Techniques: Adding new models to the existing ensemble to target different data patterns and forecasting challenges may yield more robust and accurate solutions.
Enhancement of Hyperparameter Optimization: Continued refinement of hyperparameter optimization strategies will further improve model performance and forecasting accuracy.
MLflow integration with Optuna: Record and track all Optuna trials during future re-trainings with MLflow.

In conclusion, the PREDIS project has successfully demonstrated the application of state-of-the-art forecasting models and advanced data processing techniques. The insights gained pave the way for future innovations in time series forecasting and energy management for E-REDES.

About E-REDES

E-REDES is a Distribution System Operator (DSO) supplying electricity to all connected consumers across Portugal.

Mission: supply electricity to all consumers ensuring quality, security and efficiency, while promoting a sustainable grid development that supports the energy transition and is able to provide, in a neutral way, services to the market agents.

1 High/Medium Voltage Concession Granted by the Government
278 Low Voltage Concessions Granted by Municipalities

Databricks Community

[CUSTOMER BLOG] Efficient Distributed Energy Load Forecasting with Databricks at EDP E-REDES

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL