Unable to clear cache using a pyspark session

maartenvr
New Contributor III

Hi all,

I am using a persist call on a spark dataframe inside an application to speed-up computations. The dataframe is used throughout my application and at the end of the application I am trying to clear the cache of the whole spark session by calling clear cache on the spark session. However, I am unable to clear the cache.

So something along these lines happens:

# Python Code
from pyspark.sql import SparkSession
spark = SparkSession.builder.config().getOrCreate()
 
# Just an Example 
df = spark.read.csv("example.csv")
df.cache()
 
# Clearing Cache
spark.catalog.clearCache()

The clearCache command doesn't do anything and the cache is still visible in the spark UI. (databricks -> SparkUI -> Storage.)

The following command also doesn't show any persistent RDD's, while in reality the storage in the UI shows multiple cached RDD's.

# Python Code
from pyspark.sql import SQLContext
spark_context = spark._sc
sql_context = SQLContext(spark_context)
spark._jsc.getPersistentRDDs()
 
# Results in:
{}

What is the correct way to clear the cache of the spark session / spark cluster?

Specs: I am on Databrick runtime 10.4 LST and coherently I am using the databricks-connect==10.4.18.