cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Lost memory when using dbutils

GC-James
Contributor II

Why does copying a 9GB file from a container to the /dbfs lose me 50GB of memory? (Which doesn't come back until I restarted the cluster)

image

1 ACCEPTED SOLUTION

Accepted Solutions

@James Smith​ You can reach out to Michael.Dibble@guycarp.com or please follow this doc;

https://docs.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-reque...

View solution in original post

17 REPLIES 17

GC-James
Contributor II

image

Hubert-Dudek
Esteemed Contributor III

I guess when is read from one directory is copied to RAM and than from there save it in another place (there are also concurrent reads, writes etc. so probably because of that it consume more space). As cluster still have many free memory it wasn't cleaned automatically. You can try to force clean it by using spark.catalog.clearCache()

Thanks for the ideas/suggestion. Unfortunately doing spark.cataglog.clearCache() does not return the memory / clean it.

GC-James
Contributor II

Particularly annoying as it means the rest of my code/future memory is missing this baseline.

image

Kaniz
Community Manager
Community Manager

Hi @James Smith​ , This link perhaps might help you in this issue.

Hi Fatma. The article says:

"The Delta cache works for all Parquet files and is not limited to Delta Lake format files. The Delta cache supports reading Parquet files in .... ...... It does not support other storage formats such as CSV, JSON, and ORC.

I am copying tif files from a Azure Data Lake Storage Gen2 to the /dbfs/ usin dbutils.fs.cp command. So I don't think that the article is relevant is it?

Kaniz
Community Manager
Community Manager

Acknowledged.

rajeev_thakur_c
New Contributor III
New Contributor III

@James Smith​ 

The current implementation of dbutils.fs is single-threaded, meaning that regardless of whether it’s executed on the driver or inside a Spark job, it will perform recursive operations in a single-threaded loop.

The current implementation performs the initial listing on the driver and subsequently launches a Spark job to perform the per-file operations.

So the memory will definitely be in use, but the thing is the unreferenced objects should be cleaned up. Else this would cause the heap to pile up (memory leak).

Hi @Rajeev Kumar​  . I understand why memory is used during the transfer of the file. But should the memory not be returned after the file has been moved? I do not understand why this does not happen?

rajeev_thakur_c
New Contributor III
New Contributor III

Did the GC cycle happen after the cp command? The memory will be reclaimed based on the requirement for the new coming objects. We do not explicitly clean any memory.. it is taken care by the JVM.

and yes ideally it should, the unreferenced objects are to be cleared during GC.

If this is not happening then there might me some memory leak issue.

We can run an inti script to collect heap dump of driver and executor then we can show the objects occupying most of the memory and then can act accordingly.

"Did the GC cycle happen after the cp command?"

How would I know if this happened or not?

rajeev_thakur_c
New Contributor III
New Contributor III

@James Smith​ please file a case if you can.

Can you tell me how to do that please? I have not done it before.

You can file it to Azure. And they will reach out to us.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.