cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

speed up a for loop in python (azure databrick)

Jackie
New Contributor II

code example

# a list of file path

list_files_path = ["/dbfs/mnt/...", ..., "/dbfs/mnt/..."]

# copy all file above to this folder

dest_path=""/dbfs/mnt/..."

for file_path in list_files_path:

# copy function

copy_file(file_path, dest_path)

I am running it in the azure databrick and it works fine. But I am wondering if I can utilize the power of parallel of cluster in the databrick.

I know that I can run the some kind of multi-threading in the master node but I am wondering if I can use pandas_udf to take advantage of work nodes as well.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

@Jackie Chan​ , To use spark parallelism you could register both destination as tables an use COPY INTO or register just source as table and use CREATE TABLE CLONE.

If you want to use normal copy it is better to use dbutils.fs library

If you want to copy regularly data between ADSL/blobs nothing can catch up with Azure Data Factory. There you can make copy pipeline, it will be cheapest and fastest. If you need depedency to tun databricks notebook before/after copy you can orchestrate it there (on successful run databricks notebook etc.) as databricks is integrated with ADF.

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

@Jackie Chan​ , To use spark parallelism you could register both destination as tables an use COPY INTO or register just source as table and use CREATE TABLE CLONE.

If you want to use normal copy it is better to use dbutils.fs library

If you want to copy regularly data between ADSL/blobs nothing can catch up with Azure Data Factory. There you can make copy pipeline, it will be cheapest and fastest. If you need depedency to tun databricks notebook before/after copy you can orchestrate it there (on successful run databricks notebook etc.) as databricks is integrated with ADF.

-werners-
Esteemed Contributor III

@Jackie Chan​ , Indeed ADF has massive throughput. So go for ADF if you want a plain copy (so no transformations).

Hemant
Valued Contributor II

@Jackie Chan​ , What's the data size you want to copy? If it's bigger, then use ADF.

Hemant Soni

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group