Databricks Community

playermanny2 · ‎04-25-2023

We have Databricks set up and running on Azure. Now we want to connect it with Redshift (AWS) to perform further downstream analysis for our redshift users.

I could find the documentation on how to do it within the same cloud (Either AWS or Azure) but not cross cloud.

So was wondering what would be the best approach to allow redshift to read the delta lake hosted in azure. I was hoping some sort of glue catalog could be set-up and that could allow reading from redshift as an external table

Highly appreciate the help.

Anonymous · ‎04-26-2023

@Manny Cato :

To allow Redshift to read data from Delta Lake hosted on Azure, you can use AWS Glue Data Catalog as an intermediary. The Glue Data Catalog is a fully managed metadata catalog that integrates with a variety of data sources, including Delta Lake and Redshift, to enable cross-cloud data integration.

Here are the high-level steps you can follow to set up this integration:

Create an AWS Glue Data Catalog in your AWS account. This will serve as the metadata repository for your data.
Set up a Glue Crawler to discover the schema and metadata for your Delta Lake table(s) hosted on Azure.
Configure a Glue ETL job to extract the data from your Delta Lake table(s) and load it into a Redshift cluster.
Define an external schema in Redshift that points to the Glue Data Catalog.
Create external tables in Redshift that reference the data in the Glue Data Catalog.
Query the data in Redshift as needed.

Note that there may be additional setup required for network connectivity between Azure and AWS, such as configuring VPC peering or VPN connections.

Overall, the approach of using AWS Glue Data Catalog as an intermediary allows you to seamlessly integrate data between cloud environments, while maintaining control over your data and maintaining a consistent metadata repository.

playermanny2 · ‎04-27-2023

Thank you -- would you happen to know the details on how to set-up at that crawler? There is an option for delta lake, but for the url it askes for an s3 location. Would i just plug in a azure data lake storage location, and how would authentication work?

Databricks Community

Reading data in Azure Databricks Delta Lake from AWS Redshift

Join Us as a Local Community Builder!

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Level Up with Databricks Specialist Sessions

🚀 Weekly Delta (24-30 September): A Look Back at This Week’s Top Community Highlights!

Announcing Data Intelligence for Cybersecurity