02-28-2023 05:06 AM
Hi all,
I am using a persist call on a spark dataframe inside an application to speed-up computations. The dataframe is used throughout my application and at the end of the application I am trying to clear the cache of the whole spark session by calling clear cache on the spark session. However, I am unable to clear the cache.
So something along these lines happens:
# Python Code
from pyspark.sql import SparkSession
spark = SparkSession.builder.config().getOrCreate()
# Just an Example
df = spark.read.csv("example.csv")
df.cache()
# Clearing Cache
spark.catalog.clearCache()
The clearCache command doesn't do anything and the cache is still visible in the spark UI. (databricks -> SparkUI -> Storage.)
The following command also doesn't show any persistent RDD's, while in reality the storage in the UI shows multiple cached RDD's.
# Python Code
from pyspark.sql import SQLContext
spark_context = spark._sc
sql_context = SQLContext(spark_context)
spark._jsc.getPersistentRDDs()
# Results in:
{}
What is the correct way to clear the cache of the spark session / spark cluster?
Specs: I am on Databrick runtime 10.4 LST and coherently I am using the databricks-connect==10.4.18.
03-13-2023 05:15 AM
clearCache@Maarten van Raaij :
Reason for calling unpersist() after clearCache() ->
When you call spark.catalog.clearCache(), it clears the cache of all cached tables and DataFrames in Spark. However, it's important to note that the clearCache()
method only removes the metadata associated with the cached tables and DataFrames, and not the actual cached data itself. The actual cached data remains in memory until it is either evicted due to memory pressure or until it is explicitly unpersisted using the unpersist() method.
03-08-2023 04:17 AM
@Maarten van Raaij : Please try the below and experiment from the options:
03-12-2023 09:44 PM
Hi @Maarten van Raaij
Hope everything is going great.
Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.
Cheers!
03-13-2023 02:52 AM
No solution yet:
Hi @Suteja Kanuri ,
Thank you for thinking along and replying!
Unfortunately, I have not found a solution yet.
Besides these points, I am also wondering why the cache is showing up in my SparkUI and is used when applying calculations on the data, but I can not get the persistent RDD's when I am using the sql_context. (last code block in the original post).
Are there any other ideas I could try?
Kind regards,
Maarten
03-13-2023 04:38 AM
@Maarten van Raaij :
Example:
from pyspark.sql import SparkSession
# create a Spark session
spark = SparkSession.builder.appName("ExampleApp").getOrCreate()
# cache a DataFrame
df = spark.read.csv("data.csv")
df.cache()
# clear the cache
spark.catalog.clearCache()
# unpersist the DataFrame from memory
df.unpersist()
Note that the cache() method on the DataFrame is used to cache the data in memory. The unpersist() method is used to remove the data from memory after it is no longer needed.
It's possible that you are using the wrong Spark context to access the cached RDD. If you cache an RDD using the SparkContext object, you need to use the same object to retrieve the cached RDD later. Similarly, if you cache a DataFrame using the SparkSession object, you need to use the same object to retrieve the cached DataFrame later. If you are using the
sql_context object to access the cached RDD, it may not be able to find the cached RDD because it was cached using a different Spark context.
03-13-2023 05:01 AM
Hi Suteja,
Thanks for the quick reply.
I have already tried the ```spark.catalog.clearCache()``` method but it doesn't work and was actually the reason for me posting the question.
In your code example, you are calling unperist on the dataframe after we have cleared the cache. Just for my information, why would we call unpersist on the dataframe if we have already cleared the cache of the session (assuming it would work).
For clarity, ```df.unpersist()``` does work, but this is cumbersome to implement in my application as the df is created in a local scope and is referred to by other scopes. I want to unpersist the Df only at the end of my application, where I do not have access to the Df variable anymore. Therefore I simply want to clear the cache of the whole spark cluster at the end of my application.
On the last part: I am calling the spark.catalog.clearCache() on the same spark session in which I persist my data. The spark context and sql context are also derived from that same sparksession.
03-13-2023 05:15 AM
clearCache@Maarten van Raaij :
Reason for calling unpersist() after clearCache() ->
When you call spark.catalog.clearCache(), it clears the cache of all cached tables and DataFrames in Spark. However, it's important to note that the clearCache()
method only removes the metadata associated with the cached tables and DataFrames, and not the actual cached data itself. The actual cached data remains in memory until it is either evicted due to memory pressure or until it is explicitly unpersisted using the unpersist() method.
03-13-2023 05:24 AM
@Maarten van Raaij :
Re-answering your 2nd question on why UI shows multiple cached RDD's.
Some reasons:
It's possible that the getPersistentRDDs() method is not returning any cached RDDs because the RDDs are cached using Storage Level MEMORY_AND_DISK, which means that they can be evicted from memory and written to disk if memory pressure becomes too high. In this case, getPersistentRDDs() will only show RDDs that are stored entirely in memory (Storage Level MEMORY_ONLY or MEMORY_ONLY_SER) and not RDDs that are stored in memory and/or on disk (Storage Level MEMORY_AND_DISK or MEMORY_AND_DISK_SER).
To see all the cached RDDs, including those stored on disk, use the code below
# Python Code
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd_storage_info = sc.getRDDStorageInfo()
for info in rdd_storage_info:
print(info)
This will print out information about all the cached RDDs, including their ID, storage level, memory usage, and disk usage.
03-13-2023 08:40 AM
That might clarify, since I do use the MEMORY_AND_DISK option.
The .getRDDStorageInfo() method is actually also not supperted for me, but I have enough info to continue now. Thanks for the help!
04-01-2023 09:53 PM
@Maarten van Raaij : Thats lovely! Can you help upvote the answer that helped you the most.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group