cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
mido1978
Databricks Employee
Databricks Employee

Introduction

Announced at the Data + AI Summit in June 2023, Lakehouse Federation in Databricks is a groundbreaking new capability that allows you to query data across external data sources - including Snowflake, Synapse, many others and even Databricks itself - without having to move or copy the data. This is done by using Databricks’ Unity Catalog, which provides a unified metadata layer for all of your data.

Lakehouse Federation is a game-changer for data teams, as it breaks down the silos that have traditionally kept data locked away in different systems. With Lakehouse Federation, you can finally access all of your data in one place, making it easier to get the insights you need to make better business decisions.

As always though, not one solution is a silver bullet to your data integration and querying needs. See below for when Federation is a good fit, and for when you’d prefer to bring your data into your solution and process as part of your lakehouse platform pipelines.

A few of the benefits of using lakehouse federation in Databricks are:

  • Improved data access and discovery: Lakehouse Federation makes it easy to find and access the data you need from your database estate. This is especially important for organizations with complex data landscapes.
  • Reduced data silos: Lakehouse Federation can help to break down data silos by providing a unified view of all data across the organization.
  • Improved data governance: Lakehouse Federation can help to improve data governance by providing a single place to manage permissions and access to data from within Databricks.
  • Reduced costs: Lakehouse Federation can help to reduce costs by eliminating the need to move or copy data between different data sources.

If you are looking for a way to improve the way you access and manage your data across your analytics estate, then Lakehouse Federation in Databricks is a top choice. 

mido1978_0-1701857988961.png

 

Reality Check

Whilst Lakehouse Federation is a powerful tool, it is not a good fit for all use cases. There are some specific examples of use cases when lakehouse federation is not a good choice:

  • Real-time data processing: Lakehouse federation queries can be slower than queries on data that is stored locally in the lake. Therefore, Lakehouse Federation is not a good choice for applications that require real-time data processing.
  • Complex data transformations: Where you need complex data transformations and processing, or need to ingest and transform vast amounts of data. For probably the large majority of use cases, you will need to apply some kind of ETL/ELT process against your data to make it fit for consumption by end users. In these scenarios, it is still best to apply a medallion style approach and bring the data in, process it, clean it, then model and serve it so it is performant and fit for consumption by end users.

Therefore, whilst Lakehouse Federation is a great option for certain use cases as highlighted above, it’s not a silver bullet for all scenarios. Consider it an augmentation of your analytics capability that allows for additional use cases that need agility and direct source access for creating a holistic view of your data estate, all controlled through one governance layer.

Setting Up Your First Federated Lakehouse

With that in mind, let’s get started on  setting up your first federated Lakehouse in Databricks using Lakehouse Federation.

For this example, we will be using a familiar sample database - Adventure Works - running on an Azure SQL Database. We will be walking you through how to set up your connection to Azure SQL and how to add it as a foreign catalog inside Databricks.

Prerequisites

To set up lakehouse federation in Databricks, you will need the following prerequisites:

Setup

Setting up federation is essentially a three step process, as follows: –

  • Set up a connection
  • Set up a foreign catalog
  • Query your data sources

Setting Up A Connection

We are going to use Azure SQL Database as the test data source with the sample database AdventureWorksLT database already installed and ready to query:

mido1978_1-1701857988943.png


Example query on the source database

We want to add this database as a foreign catalog in Databricks to be able to  query it alongside other data sources. To connect to the database, we need a username, password and hostname, obtained from my Azure SQL Instance.

With these details ready, we can now go into Databricks and add the connection there as our first step.

First, expand the Catalog view, go to Connections and click “Create Connection”:

mido1978_2-1701857988867.png

To add your new connection, give it a name, choose your connection type and then add the relevant login details for that data source:

mido1978_3-1701857988912.png

 

Create a Foreign Catalog

Test your connection and verify all is well. From there, go back to the Catalog view and go to Create Catalog:

mido1978_4-1701857988917.png


From there, populate the relevant details (choosing Type as “Foreign”), including choosing the connection you created in the first step, and specifying the database you want to add as an external catalog:

mido1978_5-1701857988884.png


Once added, you can have the option of adding the relevant user permissions to the objects here, all governed by Unity Catalog (Skipped this in this article as there are no other users using this database):

mido1978_6-1701857988833.png


Our external catalog is now available for querying as you would any other catalog inside Databricks, bringing our broader data estate into our lakehouse:

mido1978_7-1701857988917.png

Querying the Federated Data

We  can now access our federated Azure SQL Database as normal, straight from our Databricks SQL Warehouse:

mido1978_8-1701857988939.png

And query it as we would any other object:

mido1978_9-1701857988894.png


Or even join it to a local delta table inside our Unity Catalog:

mido1978_10-1701857988978.png

Conclusion

What we’ve shown here is just scratching the surface of what Lakehouse Federation can do with a simple connection and query. By leveraging this offering, combined with the governance and capabilities of Unity Catalog, you can extend the range of your lakehouse estate, ensuring consistent permissions and controls across all of your data sources and thus enabling a plethora of new use cases and opportunities.

Further Reading

Setup Lakehouse Federation

Introducing Lakehouse Federation



3 Comments
Guillaume_B
New Contributor

Hi @mido1978 , how can we setup Lakehouse Federation with Microsoft Fabric/OneLake from Databricks ? From Fabric side we can mirror Unity Catalog but I didn't find the way to achieve the other way around (mirror Fabric/OneLake Catalog in Unity Catalog). Thanks

mido1978
Databricks Employee
Databricks Employee

Hi @Guillaume_B 
Unfortunately at the time of writing there's no equivalent to mirroring "back" from OneLake into UC. There's some integration pieces that need doing to make it work which is still being looked at.

Currently the best method is to copy OneLake data out of OneLake into ADLS using something like ADF, then use external tables or similar to bring into UC. ADLS passthrough also works, but isn't recommended or supported by Databricks due to the need to bypass UC credentials when being used.

Guillaume_B
New Contributor

thanks @mido1978 for your reply. So if i understand correctly it's been looked at by the Databricks development team and that's a good news !  (for now yes we are already copying the delta tables out of OneLake to Gen2 which obviously defeat the purpose of Lakehouse federation concept 😛 )