Databricks Community

100databricks · ‎12-02-2024

The problem in my hand requires me to take a set of actions on a very large data frame df_1. This set of actions results in a second data frame df_2, and from this second data frame, I have multiple downstream tasks, task_1, task_2 ... By default, these tasks will repeat the computation between df_1 and df_2. Is there a way to force evaluation at df_2 so that I don't need to repeat the same costly calculations again and again. I know I could save df_2 to force the evaluation, but wonder if there is another way to avoid writing and reading? Thanks!

filipniziol · ‎12-02-2024

Hi @100databricks,

Hi, yes, you can run df_2.cache() or df_2.persist()

(df_2.cache() is a shortcut for df_2.persist(StorageLevel.MEMORY_ONLY)

Here is the pseudo-code:

# df_1 is your large initial DataFrame
df_1 = ...

# Perform expensive transformations to get df_2
df_2 = df_1.filter(...).join(...).groupBy(...).agg(...)

# Cache df_2
df_2.persist(StorageLevel.MEMORY_AND_DISK)

# Now use df_2 in multiple tasks
result_task_1 = df_2.select(...).where(...).collect()
result_task_2 = df_2.groupBy(...).sum().show()

View solution in original post

filipniziol · ‎12-02-2024

Hi @100databricks,

Hi, yes, you can run df_2.cache() or df_2.persist()

(df_2.cache() is a shortcut for df_2.persist(StorageLevel.MEMORY_ONLY)

Here is the pseudo-code:

# df_1 is your large initial DataFrame
df_1 = ...

# Perform expensive transformations to get df_2
df_2 = df_1.filter(...).join(...).groupBy(...).agg(...)

# Cache df_2
df_2.persist(StorageLevel.MEMORY_AND_DISK)

# Now use df_2 in multiple tasks
result_task_1 = df_2.select(...).where(...).collect()
result_task_2 = df_2.groupBy(...).sum().show()