topic speed up a for loop in python (azure databrick) in Data Engineering

speed up a for loop in python (azure databrick)

Jackie — Tue, 08 Mar 2022 22:55:26 GMT

code example

# a list of file path

list_files_path = ["/dbfs/mnt/...", ..., "/dbfs/mnt/..."]

# copy all file above to this folder

dest_path=""/dbfs/mnt/..."

for file_path in list_files_path:

# copy function

copy_file(file_path, dest_path)

I am running it in the azure databrick and it works fine. But I am wondering if I can utilize the power of parallel of cluster in the databrick.

I know that I can run the some kind of multi-threading in the master node but I am wondering if I can use pandas_udf to take advantage of work nodes as well.

Thanks!

Re: speed up a for loop in python (azure databrick)

Hubert-Dudek — Wed, 09 Mar 2022 10:55:38 GMT

@Jackie Chan , To use spark parallelism you could register both destination as tables an use COPY INTO or register just source as table and use CREATE TABLE CLONE.

If you want to use normal copy it is better to use dbutils.fs library

If you want to copy regularly data between ADSL/blobs nothing can catch up with Azure Data Factory. There you can make copy pipeline, it will be cheapest and fastest. If you need depedency to tun databricks notebook before/after copy you can orchestrate it there (on successful run databricks notebook etc.) as databricks is integrated with ADF.

Re: speed up a for loop in python (azure databrick)

-werners- — Thu, 10 Mar 2022 07:04:52 GMT

@Jackie Chan , Indeed ADF has massive throughput. So go for ADF if you want a plain copy (so no transformations).

Re: speed up a for loop in python (azure databrick)

Hemant — Thu, 28 Apr 2022 02:07:59 GMT

@Jackie Chan , What's the data size you want to copy? If it's bigger, then use ADF.