Databricks

GC-James · ‎03-04-2022

Why does copying a 9GB file from a container to the /dbfs lose me 50GB of memory? (Which doesn't come back until I restarted the cluster)

rajeev_thakur_c · ‎03-16-2022

@James Smith You can reach out to Michael.Dibble@guycarp.com or please follow this doc;

https://docs.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-reque...

View solution in original post

GC-James · ‎03-04-2022

Hubert-Dudek · ‎03-04-2022

I guess when is read from one directory is copied to RAM and than from there save it in another place (there are also concurrent reads, writes etc. so probably because of that it consume more space). As cluster still have many free memory it wasn't cleaned automatically. You can try to force clean it by using spark.catalog.clearCache()

GC-James · ‎03-04-2022

Thanks for the ideas/suggestion. Unfortunately doing spark.cataglog.clearCache() does not return the memory / clean it.

GC-James · ‎03-04-2022

Particularly annoying as it means the rest of my code/future memory is missing this baseline.

Kaniz · ‎03-11-2022

Hi @James Smith , This link perhaps might help you in this issue.

GC-James · ‎03-11-2022

Hi Fatma. The article says:

"The Delta cache works for all Parquet files and is not limited to Delta Lake format files. The Delta cache supports reading Parquet files in .... ...... It does not support other storage formats such as CSV, JSON, and ORC.

I am copying tif files from a Azure Data Lake Storage Gen2 to the /dbfs/ usin dbutils.fs.cp command. So I don't think that the article is relevant is it?

Kaniz · ‎03-16-2022

Acknowledged.

rajeev_thakur_c · ‎03-16-2022

@James Smith

The current implementation of dbutils.fs is single-threaded, meaning that regardless of whether it’s executed on the driver or inside a Spark job, it will perform recursive operations in a single-threaded loop.

The current implementation performs the initial listing on the driver and subsequently launches a Spark job to perform the per-file operations.

So the memory will definitely be in use, but the thing is the unreferenced objects should be cleaned up. Else this would cause the heap to pile up (memory leak).

GC-James · ‎03-16-2022

Hi @Rajeev Kumar . I understand why memory is used during the transfer of the file. But should the memory not be returned after the file has been moved? I do not understand why this does not happen?

rajeev_thakur_c · ‎03-16-2022

Did the GC cycle happen after the cp command? The memory will be reclaimed based on the requirement for the new coming objects. We do not explicitly clean any memory.. it is taken care by the JVM.

and yes ideally it should, the unreferenced objects are to be cleared during GC.

If this is not happening then there might me some memory leak issue.

We can run an inti script to collect heap dump of driver and executor then we can show the objects occupying most of the memory and then can act accordingly.