filipniziol
Esteemed Contributor

Hi @100databricks,

Hi, yes, you can run df_2.cache() or df_2.persist()

(df_2.cache() is a shortcut for df_2.persist(StorageLevel.MEMORY_ONLY)

Here is the pseudo-code:

# df_1 is your large initial DataFrame
df_1 = ...

# Perform expensive transformations to get df_2
df_2 = df_1.filter(...).join(...).groupBy(...).agg(...)

# Cache df_2
df_2.persist(StorageLevel.MEMORY_AND_DISK)

# Now use df_2 in multiple tasks
result_task_1 = df_2.select(...).where(...).collect()
result_task_2 = df_2.groupBy(...).sum().show()

View solution in original post