Databricks

Jackie · ‎03-08-2022

code example

# a list of file path

list_files_path = ["/dbfs/mnt/...", ..., "/dbfs/mnt/..."]

# copy all file above to this folder

dest_path=""/dbfs/mnt/..."

for file_path in list_files_path:

# copy function

copy_file(file_path, dest_path)

I am running it in the azure databrick and it works fine. But I am wondering if I can utilize the power of parallel of cluster in the databrick.

I know that I can run the some kind of multi-threading in the master node but I am wondering if I can use pandas_udf to take advantage of work nodes as well.

Thanks!

Hubert-Dudek · ‎03-09-2022

@Jackie Chan , To use spark parallelism you could register both destination as tables an use COPY INTO or register just source as table and use CREATE TABLE CLONE.

If you want to use normal copy it is better to use dbutils.fs library

If you want to copy regularly data between ADSL/blobs nothing can catch up with Azure Data Factory. There you can make copy pipeline, it will be cheapest and fastest. If you need depedency to tun databricks notebook before/after copy you can orchestrate it there (on successful run databricks notebook etc.) as databricks is integrated with ADF.

View solution in original post

Hubert-Dudek · ‎03-09-2022

@Jackie Chan , To use spark parallelism you could register both destination as tables an use COPY INTO or register just source as table and use CREATE TABLE CLONE.

If you want to use normal copy it is better to use dbutils.fs library

If you want to copy regularly data between ADSL/blobs nothing can catch up with Azure Data Factory. There you can make copy pipeline, it will be cheapest and fastest. If you need depedency to tun databricks notebook before/after copy you can orchestrate it there (on successful run databricks notebook etc.) as databricks is integrated with ADF.