- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2024 03:36 PM
The problem in my hand requires me to take a set of actions on a very large data frame df_1. This set of actions results in a second data frame df_2, and from this second data frame, I have multiple downstream tasks, task_1, task_2 ... By default, these tasks will repeat the computation between df_1 and df_2. Is there a way to force evaluation at df_2 so that I don't need to repeat the same costly calculations again and again. I know I could save df_2 to force the evaluation, but wonder if there is another way to avoid writing and reading? Thanks!
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2024 11:55 PM
Hi @100databricks,
Hi, yes, you can run df_2.cache() or df_2.persist()
(df_2.cache() is a shortcut for df_2.persist(StorageLevel.MEMORY_ONLY)
Here is the pseudo-code:
# df_1 is your large initial DataFrame
df_1 = ...
# Perform expensive transformations to get df_2
df_2 = df_1.filter(...).join(...).groupBy(...).agg(...)
# Cache df_2
df_2.persist(StorageLevel.MEMORY_AND_DISK)
# Now use df_2 in multiple tasks
result_task_1 = df_2.select(...).where(...).collect()
result_task_2 = df_2.groupBy(...).sum().show()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-02-2024 11:55 PM
Hi @100databricks,
Hi, yes, you can run df_2.cache() or df_2.persist()
(df_2.cache() is a shortcut for df_2.persist(StorageLevel.MEMORY_ONLY)
Here is the pseudo-code:
# df_1 is your large initial DataFrame
df_1 = ...
# Perform expensive transformations to get df_2
df_2 = df_1.filter(...).join(...).groupBy(...).agg(...)
# Cache df_2
df_2.persist(StorageLevel.MEMORY_AND_DISK)
# Now use df_2 in multiple tasks
result_task_1 = df_2.select(...).where(...).collect()
result_task_2 = df_2.groupBy(...).sum().show()
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-03-2024 10:25 AM
Thank you! This is exactly what I need.

