cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How can I force a data frame to evaluate without saving it?

100databricks
New Contributor III

The problem in my hand requires me to take a set of actions on a very large data frame df_1. This set of actions results in a second data frame df_2, and from this second data frame, I have multiple downstream tasks, task_1, task_2 ...  By default, these tasks will repeat the computation between df_1 and df_2. Is there a way to force evaluation at df_2 so that I don't need to repeat the same costly calculations again and again. I know I could save df_2 to force the evaluation, but wonder if there is another way to avoid writing and reading? Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

filipniziol
Contributor III

Hi @100databricks,

Hi, yes, you can run df_2.cache() or df_2.persist()

(df_2.cache() is a shortcut for df_2.persist(StorageLevel.MEMORY_ONLY)

Here is the pseudo-code:

# df_1 is your large initial DataFrame
df_1 = ...

# Perform expensive transformations to get df_2
df_2 = df_1.filter(...).join(...).groupBy(...).agg(...)

# Cache df_2
df_2.persist(StorageLevel.MEMORY_AND_DISK)

# Now use df_2 in multiple tasks
result_task_1 = df_2.select(...).where(...).collect()
result_task_2 = df_2.groupBy(...).sum().show()

View solution in original post

2 REPLIES 2

filipniziol
Contributor III

Hi @100databricks,

Hi, yes, you can run df_2.cache() or df_2.persist()

(df_2.cache() is a shortcut for df_2.persist(StorageLevel.MEMORY_ONLY)

Here is the pseudo-code:

# df_1 is your large initial DataFrame
df_1 = ...

# Perform expensive transformations to get df_2
df_2 = df_1.filter(...).join(...).groupBy(...).agg(...)

# Cache df_2
df_2.persist(StorageLevel.MEMORY_AND_DISK)

# Now use df_2 in multiple tasks
result_task_1 = df_2.select(...).where(...).collect()
result_task_2 = df_2.groupBy(...).sum().show()

Thank you! This is exactly what I need. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group