In the modern data landscape, data is often scattered across various specialized databases, cloud warehouses, and legacy systems. Traditionally, unifying this data required complex Extract, Transform, Load (ETL) pipelines to move everything into a single central repository. Lakehouse Federation in Databricks changes this paradigm, allowing you to query external data sources directly from your Databricks workspace without moving the data first.
Lakehouse Federation is a query federation platform integrated within Databricks Unity Catalog. It enables users to run queries against multiple external data sources as if they were local tables in the Databricks Lakehouse. This approach minimizes data movement, maintains live access to operational systems, and simplifies data governance.
There are two primary ways Databricks handles this: Query Federation and Catalog Federation.
Query Federation allows you to run SQL queries against external databases without migrating the data into Databricks. When a query is executed, Databricks “pushes down” parts of the query to the foreign database using JDBC. This means the work is distributed between Databricks compute and the remote database’s compute engine.
from databricks docs
It is the “fast-track” to data insights, particularly in scenarios where full data ingestion isn’t practical:
Databricks offers two ways to handle external data: Query Federation and Lakeflow Connect.
Query Federation supports a massive variety of popular data sources, including:
Setting up federation is designed to be straightforward within the Unity Catalog framework:
If you use the API to create a connection to the data source, you will need to create foreign catalog creation seperately.
Reference image from databricks
Reference image from Mysql
Unlike standard query federation that pushes queries to a remote database engine, Catalog Federation allows Unity Catalog to directly access foreign tables in object storage (S3, ADLS, GCS). By bypassing the “middle-man” compute of a remote database, your queries run entirely on Databricks compute, making them significantly faster and more cost-effective.
from databricks docs
One standout feature for Hive metastore federation is Authorized Paths. To prevent users from maliciously redirecting tables to sensitive data locations in an unsecure Hive metastore, admins define specific “authorized” cloud storage sub-paths. This ensures that even if a metadata entry is changed, Unity Catalog won’t allow access unless the path is pre-approved.
Understanding the difference between these two types of federation is key to choosing the right tool for your use case.
|
Feature |
Query Federation |
Catalog Federation |
|
Connection Method |
JDBC (Standard database connection) |
Direct Object Storage (S3, GCS, ADLS) |
|
Source Type |
Databases & Warehouses |
Lake Metadata Systems |
|
Query Type |
Pushed down to foreign database via JDBC |
Direct access to foreign foreign tables via object storage |
|
Compute Engine |
Runs on both Databricks + Remote DB Engine |
Runs on Databricks Compute Only |
|
Performance |
Performance depends on the remote source’s ability to handle the pushed-down query |
Performance-optimized and cost-effective as it leverages direct storage access. |
|
Cost |
Can be higher (remote compute costs) |
Lower (uses native Databricks compute) |
|
Data Format |
Any (MySQL, SQL Server, Redshift, etc.) |
Open formats (Parquet, Delta, Iceberg) |
|
Best For |
Ad-hoc reporting, Proof of Concept (PoCs), Live access to operational databases |
Phases migrations to Unity Catalog or maintaining a long-term hybrid model with external catalogs. |
Lakehouse Federation is not just about convenience; it’s a strategic tool for accelerating time-to-insight. By removing the requirement for immediate ETL, data teams can provide business users with access to new data sources in minutes rather than weeks.
For high-volume, low-latency requirements, Databricks recommends Lakeflow Connect as an alternative to federation if the performance of live JDBC connections becomes a bottleneck.
In Short the Federation is excellent for:
But for:
We should ingest data into Delta Lake and optimize it.
Think of federation as a bridge ----- not always the final destination.
For a deeper technical dive, check out the official Databricks Lakehouse Federation Docs.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.