Databricks Community

KosmaS · ‎07-28-2024

To cache/persist an action needs to be triggered. I'm just wondering, will it make any difference if, after persisting some df, I use, for instance, take(5) instead of count()?

Will it be a bit more effective, because of sending results from 5 partitions to the driver?
Instead of calculating counts for all the partitions?

Rishabh-Pandey · ‎07-28-2024

Yes take (5) will be more efficient in some ways.

When you cache or persist a DataFrame in Spark, you are instructing Spark to store the DataFrame's intermediate data in memory (or on disk, depending on the storage level). This can significantly speed up subsequent actions on that DataFrame, because Spark doesn't need to recompute the DataFrame from the source data.

Impact of take(5) and count() on Cached Data

count() with Cached Data:

Behavior: When you call count() on a cached or persisted DataFrame, Spark will leverage the cached data to compute the count. Since the data is already stored in memory (or on disk), the count operation is generally faster than it would be if the data were not cached. However, it still requires a scan of the entire DataFrame and aggregation of row counts from all partitions.
Performance: Although faster due to caching, count() will still involve scanning all partitions and aggregating results, which might be relatively expensive for very large DataFrames.

take(5) with Cached Data:

Behavior: When you call take(5) on a cached or persisted DataFrame, Spark uses the cached data to retrieve the first 5 rows. Since it doesn’t need to scan the entire DataFrame, it can be much faster. It only scans enough partitions to retrieve 5 rows and then stops.

Performance: Because it processes only a subset of the data (the first 5 rows) and stops once it has retrieved them, take(5) is typically very efficient. Even with cached data, it benefits from the reduced amount of data processing.

Rishabh Pandey

View solution in original post

p4pratikjain · ‎07-28-2024

If you want to cache the dataframe without having to perform action, You can use SparSQL. Python API is always evaluated as lazy for both cache() and persist(). Instead SparkSQL gives option to specify if you want to evaluate it lazy or eager.
Reference -
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html

Pratik Jain

Rishabh-Pandey · ‎07-28-2024

Yes take (5) will be more efficient in some ways.

When you cache or persist a DataFrame in Spark, you are instructing Spark to store the DataFrame's intermediate data in memory (or on disk, depending on the storage level). This can significantly speed up subsequent actions on that DataFrame, because Spark doesn't need to recompute the DataFrame from the source data.

Impact of take(5) and count() on Cached Data

count() with Cached Data:

Behavior: When you call count() on a cached or persisted DataFrame, Spark will leverage the cached data to compute the count. Since the data is already stored in memory (or on disk), the count operation is generally faster than it would be if the data were not cached. However, it still requires a scan of the entire DataFrame and aggregation of row counts from all partitions.
Performance: Although faster due to caching, count() will still involve scanning all partitions and aggregating results, which might be relatively expensive for very large DataFrames.

take(5) with Cached Data:

Behavior: When you call take(5) on a cached or persisted DataFrame, Spark uses the cached data to retrieve the first 5 rows. Since it doesn’t need to scan the entire DataFrame, it can be much faster. It only scans enough partitions to retrieve 5 rows and then stops.

Performance: Because it processes only a subset of the data (the first 5 rows) and stops once it has retrieved them, take(5) is typically very efficient. Even with cached data, it benefits from the reduced amount of data processing.

Rishabh Pandey

KosmaS · ‎07-30-2024

Thanks Rishabh-Pandey for your reply!

So to clarify by giving some code example. Let's say this is a PROD code:

df1 = spark.read..... + transformations

df1.persist(StorageLevel.DISK_ONLY) or df1.cache()


df2 = df1.transformation...
df3 = df1.transformation...

df4 = df2.union(df3)

df4.write...

I noticed that some people are stating that:

1. If there's one action in the whole ETL then it's enough to persist/cache data. Like in the example above.
2. Other people are stating that if you persist/cache you need to right away after trigger some action (like in my previous question - take(5) for instance).

So who's right? Will Spark apply to persist/cache even if the action will be at the end of the bunch of transformations?

I understand that you need to call some action right after persisting/caching if you want to cache/persist some other dataframe on the way in a long chain of transformations and unpersist the previous one. Like here:

df1 = spark.read..... + transformations

df1.persist(StorageLevel.DISK_ONLY) or df1.cache()
df1.take(1) <<<<==== here

df2 = df1.transformation...
df3 = df1.transformation...

df4 = df2.union(df3)
df4.persist(StorageLevel.DISK_ONLY)
df1.unpersist()

df4.write...

But if that's not the case, then what is the purpose? Does it make any sense?

Databricks Community

Efficient caching/persisting

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences