cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

dbutils.fs.mv taking too long with delta table

anmol_deep
New Contributor III

I have a folder which contains multiple delta tables and some parquet tables. I want to move that folder to another path. When I use dbutils.fs.mv(), it takes an absurd amount of time.

1 ACCEPTED SOLUTION

Accepted Solutions

anmol_deep
New Contributor III

Hi @Kaniz Fatma​ ! Please convey my request to the development team - Make dbutils.fs commands faster. Implement multithreading/multiprocessing as it seems dbutils.fs commands are single threaded only. If this is not the best place to share feedback, let me know where I can do this.

View solution in original post

10 REPLIES 10

Hubert-Dudek
Esteemed Contributor III

Dbutils is single thread so it can be like that. You can use COPY or INSERT INTO specially when both places are registered in metastore. If it is exactly 1:1 copy I would recommend Azure Data Factory copy utility as it have big throughput and is cheap. From ADF you can trigger databricks notebook as well. From databricks you can trigger ADF pipeline using logic apps.

anmol_deep
New Contributor III

Thanks @Hubert Dudek​ !

Actually I want to delete the folder. But when I try to do that, I get this error: shaded.databricks.org.apache.hadoop.fs.azure.AzureException: hadoop_azure_shaded.com.microsoft.azure.storage.StorageException: This operation is not permitted on a non-empty directory.

That's why I switched to mv.

As its single threaded, would you advise to use the threading library of python and delete each delta table in one thread? Will that be a good idea, or could it have unintended consequences?

Hubert-Dudek
Esteemed Contributor III

If you want delete recursively you need to add True. Deleting is faster so I think it doesn't make sense to orchestrate whole pipeline.

dbutils.fs.rm('/path', True)

I have tried that. It doesn't work and throw the error I mentioned above. (I did add recurse=True).

Even if I try to delete using %sh rm -rf, the same error occurs. All files get deleted except this folder: _delta_log. If I try to delete it, it gives the error I mentioned above

Hubert-Dudek
Esteemed Contributor III

maybe that delta table is still registered in hive metastore? Can you check tables and run DROP TABLE?

Hubert-Dudek
Esteemed Contributor III

Please try also reboot cluster and upgrade runtime version.

Kaniz
Community Manager
Community Manager

Hi @Anmol Deep​ , Did you try to follow @Hubert Dudek​ 's suggestion? Did it help you resolve your problem?

anmol_deep
New Contributor III

Hi @Kaniz Fatma​  and @Hubert Dudek​ ! Yes I tried Hubert's suggestions, but it didn't work for me. I was still getting this error. I also tried the df.vaccum(0) command and then tried to delete but even that didn't work. I kept getting the same error again and again.

anmol_deep
New Contributor III

Hi @Kaniz Fatma​ ! Please convey my request to the development team - Make dbutils.fs commands faster. Implement multithreading/multiprocessing as it seems dbutils.fs commands are single threaded only. If this is not the best place to share feedback, let me know where I can do this.

Kaniz
Community Manager
Community Manager

Hi @Anmol Deep​ , Thank you for taking the time to share your valuable suggestions. You can share your feedback here.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.