cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

speed up a for loop in python (azure databrick)

Jackie
New Contributor II

code example

# a list of file path

list_files_path = ["/dbfs/mnt/...", ..., "/dbfs/mnt/..."]

# copy all file above to this folder

dest_path=""/dbfs/mnt/..."

for file_path in list_files_path:

# copy function

copy_file(file_path, dest_path)

I am running it in the azure databrick and it works fine. But I am wondering if I can utilize the power of parallel of cluster in the databrick.

I know that I can run the some kind of multi-threading in the master node but I am wondering if I can use pandas_udf to take advantage of work nodes as well.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

@Jackie Chan​ , To use spark parallelism you could register both destination as tables an use COPY INTO or register just source as table and use CREATE TABLE CLONE.

If you want to use normal copy it is better to use dbutils.fs library

If you want to copy regularly data between ADSL/blobs nothing can catch up with Azure Data Factory. There you can make copy pipeline, it will be cheapest and fastest. If you need depedency to tun databricks notebook before/after copy you can orchestrate it there (on successful run databricks notebook etc.) as databricks is integrated with ADF.

View solution in original post

4 REPLIES 4

Hubert-Dudek
Esteemed Contributor III

@Jackie Chan​ , To use spark parallelism you could register both destination as tables an use COPY INTO or register just source as table and use CREATE TABLE CLONE.

If you want to use normal copy it is better to use dbutils.fs library

If you want to copy regularly data between ADSL/blobs nothing can catch up with Azure Data Factory. There you can make copy pipeline, it will be cheapest and fastest. If you need depedency to tun databricks notebook before/after copy you can orchestrate it there (on successful run databricks notebook etc.) as databricks is integrated with ADF.

-werners-
Esteemed Contributor III

@Jackie Chan​ , Indeed ADF has massive throughput. So go for ADF if you want a plain copy (so no transformations).

Kaniz
Community Manager
Community Manager

Hi @Jackie Chan​ , Just a friendly follow-up. Do you still need help, or do the above responses help you find the solution? Please let us know.

Hemant
Valued Contributor II

@Jackie Chan​ , What's the data size you want to copy? If it's bigger, then use ADF.

Hemant Soni
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.