cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Pipelines with alot of Spark Caching - best practices for cleanup?

Michael_Galli
Contributor II

We have the situation where many concurrent Azure Datafactory Notebooks are running in one single Databricks Interactive Cluster (Azure E8 Series Driver, 1-10 E4 Series Drivers autoscaling).

Each notebook reads data, does a dataframe.cache(), just to create some counts before / after running a dropDuplicates() for logging as metrics / data quality purposes, like SSIS.

After a few hours, the jobs on the cluster will fail, and the cluster needs a reboot. I think the caching is the reason.

Is it recommended to use spark.catalog.clearCache() at the end of each notebook (does this affect other running jobs on the cluster?), or are there other ideas for better cluster cache cleanup?

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

This cache is dynamically saved to disk if there is no place in memory. So I don't see it as an issue. However, the best practice is to use "unpersist()" method in your code after caching. As in the example below, my answer, the cache/persist method is only proper when running multiple operations on the same data frame. You don't need it in other situations, as Spark 3.0 with AQE and optimizations will handle it better. Please check logs why the cluster is failing.

df_cached = df.persist()
# here is good to cache as we will use df_cached  multiple times
df1 = df_cached.some_actions_and_transformations
df2 = df_cached.some_actions_and_transformations
df3 = df_cached.some_actions_and_transformations
# clean cache
df_cached.unpersist()

View solution in original post

1 REPLY 1

Hubert-Dudek
Esteemed Contributor III

This cache is dynamically saved to disk if there is no place in memory. So I don't see it as an issue. However, the best practice is to use "unpersist()" method in your code after caching. As in the example below, my answer, the cache/persist method is only proper when running multiple operations on the same data frame. You don't need it in other situations, as Spark 3.0 with AQE and optimizations will handle it better. Please check logs why the cluster is failing.

df_cached = df.persist()
# here is good to cache as we will use df_cached  multiple times
df1 = df_cached.some_actions_and_transformations
df2 = df_cached.some_actions_and_transformations
df3 = df_cached.some_actions_and_transformations
# clean cache
df_cached.unpersist()

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!