Hi @100databricks,
Hi, yes, you can run df_2.cache() or df_2.persist()
(df_2.cache() is a shortcut for df_2.persist(StorageLevel.MEMORY_ONLY)
Here is the pseudo-code:
# df_1 is your large initial DataFrame
df_1 = ...
# Perform expensive transformations to get df_2
df_2 = df_1.filter(...).join(...).groupBy(...).agg(...)
# Cache df_2
df_2.persist(StorageLevel.MEMORY_AND_DISK)
# Now use df_2 in multiple tasks
result_task_1 = df_2.select(...).where(...).collect()
result_task_2 = df_2.groupBy(...).sum().show()