cancel
Showing results for 
Search instead for 
Did you mean: 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results for 
Search instead for 
Did you mean: 

Constantly Running Interactive Clusters Best Practices

MartinK
New Contributor

Hello there,

 

I’ve been creating an ETL/ELT Pipeline with Azure Databricks Workflows, Spark and Azure Data Lake. It should process in Near Real Time changes  (A Change Data Capture process) from an Azure SQL Database.

 

For that purpose, I will have several Databricks Workflows that will run continuously on one and the same Interactive cluster (like 7 - 10 Databricks Workflows). So, the cluster will be shared between these 7 – 10 Workflows and will run perpetually. Every of these Workflows will process CDC data from one table (these are big tables with SQL Server CDC enabled on them).

 

Additional to that, I will have several Databricks Workflows that will run in batch mode with Job Clusters and will process data from small tables.

 

My question concerns the Interactive Cluster that will run perpetually in continuous mode: Are there some best practices when one has a continuous running process with one and the same interactive cluster?  Like, should one perform periodic downtime of the process when the cluster can be restarted or for example replaced by another BackUp cluster, while this one is being restarted? Or should one execute periodic clean up of the cache of the cluster? Or may be some other activities are good to be done periodically?

 

Many thanks for your answer in advance!

 

BR,

Martin

1 ACCEPTED SOLUTION

Accepted Solutions

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @MartinK,

Thanks for your question:

When running a continuous process on an interactive cluster in Databricks, here are some suggestions:

 

Periodic Cluster Restart: It is advisable to periodically restart the cluster to clear any accumulated state and prevent potential memory leaks or other long-running issues.

Cluster Utilization Monitoring: Continuously monitor the cluster's performance metrics such as CPU, memory usage, and disk I/O. This helps in identifying any performance bottlenecks or resource constraints that might require scaling the cluster or optimizing the workloads.

Cache Management: Periodically clean up the cache to free up memory and ensure that the cluster does not run out of resources. This can be done using the spark.catalog.clearCache() command.

Backup Cluster: Having a backup cluster that can take over in case of failures is a good practice. This ensures high availability and minimizes downtime.

Cluster Configuration: Ensure that the cluster is configured with the appropriate number of nodes and instance types to handle the workload efficiently. Autoscaling can be enabled to adjust the cluster size based on the workload.

Job Isolation: For batch jobs, use job clusters instead of the interactive cluster to avoid interference with the continuous processes. This ensures that the batch jobs do not impact the performance of the continuous workloads.

Error Handling and Alerts: Implement robust error handling and set up alerts to notify you of any issues with the cluster or the workloads. This helps in quickly addressing any problems that arise.

View solution in original post

3 REPLIES 3

Alberto_Umana
Databricks Employee
Databricks Employee

Hello @MartinK,

Thanks for your question:

When running a continuous process on an interactive cluster in Databricks, here are some suggestions:

 

Periodic Cluster Restart: It is advisable to periodically restart the cluster to clear any accumulated state and prevent potential memory leaks or other long-running issues.

Cluster Utilization Monitoring: Continuously monitor the cluster's performance metrics such as CPU, memory usage, and disk I/O. This helps in identifying any performance bottlenecks or resource constraints that might require scaling the cluster or optimizing the workloads.

Cache Management: Periodically clean up the cache to free up memory and ensure that the cluster does not run out of resources. This can be done using the spark.catalog.clearCache() command.

Backup Cluster: Having a backup cluster that can take over in case of failures is a good practice. This ensures high availability and minimizes downtime.

Cluster Configuration: Ensure that the cluster is configured with the appropriate number of nodes and instance types to handle the workload efficiently. Autoscaling can be enabled to adjust the cluster size based on the workload.

Job Isolation: For batch jobs, use job clusters instead of the interactive cluster to avoid interference with the continuous processes. This ensures that the batch jobs do not impact the performance of the continuous workloads.

Error Handling and Alerts: Implement robust error handling and set up alerts to notify you of any issues with the cluster or the workloads. This helps in quickly addressing any problems that arise.

Thank you very much for the answer, Alberto! It is very helpful!

Alberto_Umana
Databricks Employee
Databricks Employee

No problem! let me know if you have any other questions.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group