- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-13-2023 05:24 AM
@Maarten van Raaij :
Re-answering your 2nd question on why UI shows multiple cached RDD's.
Some reasons:
It's possible that the getPersistentRDDs() method is not returning any cached RDDs because the RDDs are cached using Storage Level MEMORY_AND_DISK, which means that they can be evicted from memory and written to disk if memory pressure becomes too high. In this case, getPersistentRDDs() will only show RDDs that are stored entirely in memory (Storage Level MEMORY_ONLY or MEMORY_ONLY_SER) and not RDDs that are stored in memory and/or on disk (Storage Level MEMORY_AND_DISK or MEMORY_AND_DISK_SER).
To see all the cached RDDs, including those stored on disk, use the code below
# Python Code
from pyspark import SparkContext
sc = SparkContext.getOrCreate()
rdd_storage_info = sc.getRDDStorageInfo()
for info in rdd_storage_info:
print(info)This will print out information about all the cached RDDs, including their ID, storage level, memory usage, and disk usage.