cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Reading data in Azure Databricks Delta Lake from AWS Redshift

playermanny2
New Contributor II

We have Databricks set up and running on Azure. Now we want to connect it with Redshift (AWS) to perform further downstream analysis for our redshift users.

I could find the documentation on how to do it within the same cloud (Either AWS or Azure) but not cross cloud.

So was wondering what would be the best approach to allow redshift to read the delta lake hosted in azure. I was hoping some sort of glue catalog could be set-up and that could allow reading from redshift as an external table

Highly appreciate the help.

2 REPLIES 2

Anonymous
Not applicable

@Manny Catoโ€‹ :

To allow Redshift to read data from Delta Lake hosted on Azure, you can use AWS Glue Data Catalog as an intermediary. The Glue Data Catalog is a fully managed metadata catalog that integrates with a variety of data sources, including Delta Lake and Redshift, to enable cross-cloud data integration.

Here are the high-level steps you can follow to set up this integration:

  1. Create an AWS Glue Data Catalog in your AWS account. This will serve as the metadata repository for your data.
  2. Set up a Glue Crawler to discover the schema and metadata for your Delta Lake table(s) hosted on Azure.
  3. Configure a Glue ETL job to extract the data from your Delta Lake table(s) and load it into a Redshift cluster.
  4. Define an external schema in Redshift that points to the Glue Data Catalog.
  5. Create external tables in Redshift that reference the data in the Glue Data Catalog.
  6. Query the data in Redshift as needed.

Note that there may be additional setup required for network connectivity between Azure and AWS, such as configuring VPC peering or VPN connections.

Overall, the approach of using AWS Glue Data Catalog as an intermediary allows you to seamlessly integrate data between cloud environments, while maintaining control over your data and maintaining a consistent metadata repository.

Thank you -- would you happen to know the details on how to set-up at that crawler? There is an option for delta lake, but for the url it askes for an s3 location. Would i just plug in a azure data lake storage location, and how would authentication work?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group