Databricks Community

MartinK · ‎01-06-2025

Hello there,

I’ve been creating an ETL/ELT Pipeline with Azure Databricks Workflows, Spark and Azure Data Lake. It should process in Near Real Time changes (A Change Data Capture process) from an Azure SQL Database.

For that purpose, I will have several Databricks Workflows that will run continuously on one and the same Interactive cluster (like 7 - 10 Databricks Workflows). So, the cluster will be shared between these 7 – 10 Workflows and will run perpetually. Every of these Workflows will process CDC data from one table (these are big tables with SQL Server CDC enabled on them).

Additional to that, I will have several Databricks Workflows that will run in batch mode with Job Clusters and will process data from small tables.

My question concerns the Interactive Cluster that will run perpetually in continuous mode: Are there some best practices when one has a continuous running process with one and the same interactive cluster? Like, should one perform periodic downtime of the process when the cluster can be restarted or for example replaced by another BackUp cluster, while this one is being restarted? Or should one execute periodic clean up of the cache of the cluster? Or may be some other activities are good to be done periodically?

Many thanks for your answer in advance!

BR,

Martin

Alberto_Umana · ‎01-06-2025

Hello @MartinK,

Thanks for your question:

When running a continuous process on an interactive cluster in Databricks, here are some suggestions:

Periodic Cluster Restart: It is advisable to periodically restart the cluster to clear any accumulated state and prevent potential memory leaks or other long-running issues.

Cluster Utilization Monitoring: Continuously monitor the cluster's performance metrics such as CPU, memory usage, and disk I/O. This helps in identifying any performance bottlenecks or resource constraints that might require scaling the cluster or optimizing the workloads.

Cache Management: Periodically clean up the cache to free up memory and ensure that the cluster does not run out of resources. This can be done using the spark.catalog.clearCache() command.

Backup Cluster: Having a backup cluster that can take over in case of failures is a good practice. This ensures high availability and minimizes downtime.

Cluster Configuration: Ensure that the cluster is configured with the appropriate number of nodes and instance types to handle the workload efficiently. Autoscaling can be enabled to adjust the cluster size based on the workload.

Job Isolation: For batch jobs, use job clusters instead of the interactive cluster to avoid interference with the continuous processes. This ensures that the batch jobs do not impact the performance of the continuous workloads.

Error Handling and Alerts: Implement robust error handling and set up alerts to notify you of any issues with the cluster or the workloads. This helps in quickly addressing any problems that arise.

View solution in original post

Alberto_Umana · ‎01-06-2025