cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

copy files from azure to s3

chanansh
Contributor

I am trying to copy files from azure to s3. I've created a solution by comparing file lists and copy manually to a temp file and upload. However, I just found AutoLoader and I would like to use that https://docs.databricks.com/ingestion/auto-loader/index.html

The problem is, it is not clear from the documentation how to pass to the streamReader the azure blob storage credentials: tenant_id, container, account_url, client_id, client_secret and the azure_path.

What is the API to do that?

9 REPLIES 9

Hubert-Dudek
Esteemed Contributor III

Copying files using a data factory can be cheaper and faster.

If you want access to Blob Storage / Azure Data Lake storage, you can also make a permanent mount in databricks. I described how to do it here https://community.databricks.com/s/feed/0D53f00001eQG.OHCA4

BigMF
New Contributor III

Agreed, Azure Data Factory is definitely a better approach if all you are wanting to do is copy files to/from Azure Storage.

I need it to update all the time so I need it to keep working continuously. Anyway I only have read permissions for the azure blob.​

BigMF
New Contributor III

ADF can be scheduled to run as often as needed or triggered based on files showing up in a container. However, based on your other statement below, it appears you are not working in an Azure environment and only have access to the storage container. I guess you could use Databricks to copy file but it seems wasteful. An analogy I would use is using a metal toolbox full of tools that are very useful for specific things and you use the box to hammer a nail in.

Autoloader is the solution for me but I don't know how to set credentials ​

I am not an azure user. I only have read permissions from the blob.​

-werners-
Esteemed Contributor III

You can also use AWS Data Pipeline.

What I have read is that we are talking about a plain copy, no transformations.

In that case firing up a spark cluster is way too much overhead, and way to expensive.

If you lack permissions to connect to the azure blob, I would try to fix that and not trying to find a way around by using Databricks.

chanansh
Contributor

I want to use AutoLoader. I just need to know how to pass credentials to the StreamReader

Falokun
New Contributor II

Just use tools like Goodsync and Gs Richcopy 360 to copy directly from blob to S3, I think you will never face problems like that ​

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group