copy files from azure to s3
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 09:46 AM
I am trying to copy files from azure to s3. I've created a solution by comparing file lists and copy manually to a temp file and upload. However, I just found AutoLoader and I would like to use that https://docs.databricks.com/ingestion/auto-loader/index.html
The problem is, it is not clear from the documentation how to pass to the streamReader the azure blob storage credentials: tenant_id, container, account_url, client_id, client_secret and the azure_path.
What is the API to do that?
- Labels:
-
Autoloader
-
Azure
-
Client Secret
-
Copy
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 10:04 AM
Copying files using a data factory can be cheaper and faster.
If you want access to Blob Storage / Azure Data Lake storage, you can also make a permanent mount in databricks. I described how to do it here https://community.databricks.com/s/feed/0D53f00001eQG.OHCA4
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 11:52 AM
Agreed, Azure Data Factory is definitely a better approach if all you are wanting to do is copy files to/from Azure Storage.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 11:56 AM
I need it to update all the time so I need it to keep working continuously. Anyway I only have read permissions for the azure blob.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 06:18 PM
ADF can be scheduled to run as often as needed or triggered based on files showing up in a container. However, based on your other statement below, it appears you are not working in an Azure environment and only have access to the storage container. I guess you could use Databricks to copy file but it seems wasteful. An analogy I would use is using a metal toolbox full of tools that are very useful for specific things and you use the box to hammer a nail in.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 09:58 PM
Autoloader is the solution for me but I don't know how to set credentials
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-11-2023 11:55 AM
I am not an azure user. I only have read permissions from the blob.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-12-2023 02:35 AM
You can also use AWS Data Pipeline.
What I have read is that we are talking about a plain copy, no transformations.
In that case firing up a spark cluster is way too much overhead, and way to expensive.
If you lack permissions to connect to the azure blob, I would try to fix that and not trying to find a way around by using Databricks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-15-2023 03:36 AM
I want to use AutoLoader. I just need to know how to pass credentials to the StreamReader
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-20-2023 05:06 AM
Just use tools like Goodsync and Gs Richcopy 360 to copy directly from blob to S3, I think you will never face problems like that

