cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Efficient caching/persisting

KosmaS
New Contributor III

To cache/persist an action needs to be triggered. I'm just wondering, will it make any difference if, after persisting some df, I use, for instance, take(5) instead of count()?


Will it be a bit more effective, because of sending results from 5 partitions to the driver?
Instead of calculating counts for all the partitions?

1 ACCEPTED SOLUTION

Accepted Solutions

Rishabh-Pandey
Esteemed Contributor

Yes take (5) will be more efficient in some ways.


When you cache or persist a DataFrame in Spark, you are instructing Spark to store the DataFrame's intermediate data in memory (or on disk, depending on the storage level). This can significantly speed up subsequent actions on that DataFrame, because Spark doesn't need to recompute the DataFrame from the source data.

Impact of take(5) and count() on Cached Data


count() with Cached Data:

Behavior: When you call count() on a cached or persisted DataFrame, Spark will leverage the cached data to compute the count. Since the data is already stored in memory (or on disk), the count operation is generally faster than it would be if the data were not cached. However, it still requires a scan of the entire DataFrame and aggregation of row counts from all partitions.
Performance: Although faster due to caching, count() will still involve scanning all partitions and aggregating results, which might be relatively expensive for very large DataFrames.


take(5) with Cached Data:

Behavior: When you call take(5) on a cached or persisted DataFrame, Spark uses the cached data to retrieve the first 5 rows. Since it doesnโ€™t need to scan the entire DataFrame, it can be much faster. It only scans enough partitions to retrieve 5 rows and then stops.


Performance: Because it processes only a subset of the data (the first 5 rows) and stops once it has retrieved them, take(5) is typically very efficient. Even with cached data, it benefits from the reduced amount of data processing.

Rishabh Pandey

View solution in original post

3 REPLIES 3

p4pratikjain
Contributor

If you want to cache the dataframe without having to perform action, You can use SparSQL. Python API is always evaluated as lazy for both cache() and persist(). Instead SparkSQL gives option to specify if you want to evaluate it lazy or eager.
Reference - 
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html

Pratik Jain

Rishabh-Pandey
Esteemed Contributor

Yes take (5) will be more efficient in some ways.


When you cache or persist a DataFrame in Spark, you are instructing Spark to store the DataFrame's intermediate data in memory (or on disk, depending on the storage level). This can significantly speed up subsequent actions on that DataFrame, because Spark doesn't need to recompute the DataFrame from the source data.

Impact of take(5) and count() on Cached Data


count() with Cached Data:

Behavior: When you call count() on a cached or persisted DataFrame, Spark will leverage the cached data to compute the count. Since the data is already stored in memory (or on disk), the count operation is generally faster than it would be if the data were not cached. However, it still requires a scan of the entire DataFrame and aggregation of row counts from all partitions.
Performance: Although faster due to caching, count() will still involve scanning all partitions and aggregating results, which might be relatively expensive for very large DataFrames.


take(5) with Cached Data:

Behavior: When you call take(5) on a cached or persisted DataFrame, Spark uses the cached data to retrieve the first 5 rows. Since it doesnโ€™t need to scan the entire DataFrame, it can be much faster. It only scans enough partitions to retrieve 5 rows and then stops.


Performance: Because it processes only a subset of the data (the first 5 rows) and stops once it has retrieved them, take(5) is typically very efficient. Even with cached data, it benefits from the reduced amount of data processing.

Rishabh Pandey

Thanks Rishabh-Pandey for your reply!

So to clarify by giving some code example. Let's say this is a PROD code:

 

 

df1 = spark.read..... + transformations

df1.persist(StorageLevel.DISK_ONLY) or df1.cache()


df2 = df1.transformation...
df3 = df1.transformation...

df4 = df2.union(df3)

df4.write...

 

I noticed that some people are stating that:

1. If there's one action in the whole ETL then it's enough to persist/cache data. Like in the example above.
2. Other people are stating that if you persist/cache you need to right away after trigger some action (like in my previous question - take(5) for instance).

So who's right? Will Spark apply to persist/cache even if the action will be at the end of the bunch of transformations?

I understand that you need to call some action right after persisting/caching if you want to cache/persist some other dataframe on the way in a long chain of transformations and unpersist the previous one. Like here:

 

df1 = spark.read..... + transformations

df1.persist(StorageLevel.DISK_ONLY) or df1.cache()
df1.take(1) <<<<==== here

df2 = df1.transformation...
df3 = df1.transformation...

df4 = df2.union(df3)
df4.persist(StorageLevel.DISK_ONLY)
df1.unpersist()

df4.write...

 


But if that's not the case, then what is the purpose? Does it make any sense?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group