cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

dbutils.fs.mv taking too long with delta table

anmol_deep
New Contributor III

I have a folder which contains multiple delta tables and some parquet tables. I want to move that folder to another path. When I use dbutils.fs.mv(), it takes an absurd amount of time.

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @Kaniz Fatmaโ€‹ ! Please convey my request to the development team - Make dbutils.fs commands faster. Implement multithreading/multiprocessing as it seems dbutils.fs commands are single threaded only. If this is not the best place to share feedback, let me know where I can do this.

View solution in original post

8 REPLIES 8

Hubert-Dudek
Esteemed Contributor III

Dbutils is single thread so it can be like that. You can use COPY or INSERT INTO specially when both places are registered in metastore. If it is exactly 1:1 copy I would recommend Azure Data Factory copy utility as it have big throughput and is cheap. From ADF you can trigger databricks notebook as well. From databricks you can trigger ADF pipeline using logic apps.

anmol_deep
New Contributor III

Thanks @Hubert Dudekโ€‹ !

Actually I want to delete the folder. But when I try to do that, I get this error: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.

That's why I switched to mv.

As its single threaded, would you advise to use the threading library of python and delete each delta table in one thread? Will that be a good idea, or could it have unintended consequences?

Hubert-Dudek
Esteemed Contributor III

If you want delete recursively you need to add True. Deleting is faster so I think it doesn't make sense to orchestrate whole pipeline.

dbutils.fs.rm('/path', True)

I have tried that. It doesn't work and throw the error I mentioned above. (I did add recurse=True).

Even if I try to delete using %sh rm -rf, the same error occurs. All files get deleted except this folder: _delta_log. If I try to delete it, it gives the error I mentioned above

Hubert-Dudek
Esteemed Contributor III

maybe that delta table is still registered in hive metastore? Can you check tables and run DROP TABLE?

Hubert-Dudek
Esteemed Contributor III

Please try also reboot cluster and upgrade runtime version.

Hi @Kaniz Fatmaโ€‹  and @Hubert Dudekโ€‹ ! Yes I tried Hubert's suggestions, but it didn't work for me. I was still getting this error. I also tried the df.vaccum(0) command and then tried to delete but even that didn't work. I kept getting the same error again and again.

Hi @Kaniz Fatmaโ€‹ ! Please convey my request to the development team - Make dbutils.fs commands faster. Implement multithreading/multiprocessing as it seems dbutils.fs commands are single threaded only. If this is not the best place to share feedback, let me know where I can do this.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group