- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-28-2023 05:06 AM
Hi all,
I am using a persist call on a spark dataframe inside an application to speed-up computations. The dataframe is used throughout my application and at the end of the application I am trying to clear the cache of the whole spark session by calling clear cache on the spark session. However, I am unable to clear the cache.
So something along these lines happens:
# Python Code
from pyspark.sql import SparkSession
spark = SparkSession.builder.config().getOrCreate()
# Just an Example
df = spark.read.csv("example.csv")
df.cache()
# Clearing Cache
spark.catalog.clearCache()The clearCache command doesn't do anything and the cache is still visible in the spark UI. (databricks -> SparkUI -> Storage.)
The following command also doesn't show any persistent RDD's, while in reality the storage in the UI shows multiple cached RDD's.
# Python Code
from pyspark.sql import SQLContext
spark_context = spark._sc
sql_context = SQLContext(spark_context)
spark._jsc.getPersistentRDDs()
# Results in:
{}What is the correct way to clear the cache of the spark session / spark cluster?
Specs: I am on Databrick runtime 10.4 LST and coherently I am using the databricks-connect==10.4.18.
- Labels:
-
Pyspark Session