Re: Understanding Used Memory in Databricks Cluste...

loic · ‎02-04-2025

I have exactly the same kind of problem.

I really do not understand why my driver goes out of memory meanwhile I do not cache anything in Spark.

Since I don't cache anything, I expect references to objects that are not used anymore to be freed.

Even a simple Scala code that read a json on DBFS makes, if I execute it several times in a raw (without stopping the cluster) then the notebook execution will crash with the message:

"The spark driver has stopped unexpectedly and is restarting. Your notebook will be automatically reattached."

I scheduled this notebook execution, and after around 30 executions on DS3v2 (14G RAM, 4 cores), it crash.

This thread deals with the same kind of issue about memory usage when reading json files:

https://blog.devgenius.io/debugging-a-memory-leak-in-spark-application-22140630877d

In the comment section, the author of the page says:

"Earlier for each file spark.read.json was called, now spark.read.json is called only one at the root folder. This reduces the number of stages."

But I don't understand why the number of stages would have an impact on the memory?

I have the feeling we can have the same kind of issue when doing SQL query.

I looking for an explanation all around the web, I found many threads with memory consomption issue.

Some people have more or less the same problem that this ticket, but there was no valid answer to their problem.

Lot's of ticket are about cache issue, but one more time, I don't use cache. Nevertheless, I did the following test by adding a cell in my notebook:

spark.catalog.clearCache()
System.gc()

But as I was expecting, it didn't change nothing.

Any help is welcome!