Databricks Community

AnthonyAnand · ‎03-23-2026

Focusing on the Core
Understanding Lakehouse Federation
What is Query Federation?
The Case for Query Federation: Why It Matters
Choosing Your Path: Query Federation vs.LakeflowConnect
Ecosystem Compatibility and Supported Connectivity
How itWorks:Connecting MySQL in 5 Minutes
Know the Limits: Key Technical Constraints
What is Catalog Federation?
Why Catalog Federation Wins on Performance
Real-World Applications and Use Cases
Where Can You Connect? Supported Data Sources
The “Authorized Paths” Security Layer
Query Federation vs. Catalog Federation
StrategicImpact: Unifying the Modern Data Stack
Final Thoughts

Focusing on the Core

In the modern data landscape, data is often scattered across various specialized databases, cloud warehouses, and legacy systems. Traditionally, unifying this data required complex Extract, Transform, Load (ETL) pipelines to move everything into a single central repository. Lakehouse Federation in Databricks changes this paradigm, allowing you to query external data sources directly from your Databricks workspace without moving the data first.

Understanding Lakehouse Federation

Lakehouse Federation is a query federation platform integrated within Databricks Unity Catalog. It enables users to run queries against multiple external data sources as if they were local tables in the Databricks Lakehouse. This approach minimizes data movement, maintains live access to operational systems, and simplifies data governance.

There are two primary ways Databricks handles this: Query Federation and Catalog Federation.

What is Query Federation?

Query Federation allows you to run SQL queries against external databases without migrating the data into Databricks. When a query is executed, Databricks “pushes down” parts of the query to the foreign database using JDBC. This means the work is distributed between Databricks compute and the remote database’s compute engine.

from databricks docs

The Case for Query Federation: Why It Matters

It is the “fast-track” to data insights, particularly in scenarios where full data ingestion isn’t practical:

On-Demand Reporting: Access live operational data for real-time business reports.

Rapid Prototyping: Perform Proof-of-Concept (PoC) work or exploratory analysis before committing to a permanent ETL pipeline.

Incremental Migration: Maintain business continuity by querying legacy systems while slowly moving data to the Lakehouse.

In-Place Governance: Apply Unity Catalog’s fine-grained access controls, lineage, and discovery tools to data that stays in its original source.

Choosing Your Path: Query Federation vs. Lakeflow Connect

Databricks offers two ways to handle external data: Query Federation and Lakeflow Connect.

Choose Query Federation for ad-hoc, exploratory, or low-volume reporting where you want to avoid data movement.

Choose Lakeflow Connect for high-volume production workloads that require low-latency and scalability through managed ingestion.

Ecosystem Compatibility and Supported Connectivity

Query Federation supports a massive variety of popular data sources, including:

Cloud Warehouses: Amazon Redshift, Google BigQuery, Snowflake, and Azure Synapse.

Relational Databases: MySQL, PostgreSQL, Microsoft SQL Server, and Oracle.

Specialized Sources: Teradata and Salesforce Data 360.

How it Works: Connecting MySQL in 5 Minutes

Setting up federation is designed to be straightforward within the Unity Catalog framework:

Create a Connection: Define a connection in Unity Catalog using your credentials (e.g., username/password) and the JDBC URL for your source.

Create a Foreign Catalog: Use the connection to create a “foreign catalog.” This catalog acts as a mirror of the external database with its structure in Unity Catalog, making it appear as a standard catalog that can be queried immediately.

If you use the API to create a connection to the data source, you will need to create foreign catalog creation seperately.

Grant Permissions: Manage access at a granular level. You can grant privileges to users or groups on specific tables or schemas within the foreign catalog just like any other Unity Catalog object.

Metadata: We can specify system or governed tags as key-value pairs.

Query Away: Run standard SQL queries. For Query Federation, Unity Catalog automatically handles the translation and push-down to the external engine.

Reference image from databricks

Reference image from Mysql

Know the Limits: Key Technical Constraints

Read-Only: Federated queries are read-only; you cannot write back to the external database through this interface.

Compute Overhead: Performance depends on the remote system’s ability to handle the pushed-down query.

Memory Considerations: If a subquery returns an extremely large result set to a single Databricks task, it can lead to memory pressure on the executor.

What is Catalog Federation?

Unlike standard query federation that pushes queries to a remote database engine, Catalog Federation allows Unity Catalog to directly access foreign tables in object storage (S3, ADLS, GCS). By bypassing the “middle-man” compute of a remote database, your queries run entirely on Databricks compute, making them significantly faster and more cost-effective.

from databricks docs

Why Catalog Federation Wins on Performance

Performance Optimized: Since data is read directly from object storage, you eliminate the overhead and latency of JDBC translations.

Centralized Governance: You get the full suite of Unity Catalog features — centralized access control, data lineage, auditing, and global search — for data that technically lives outside your main metastore.

Hybrid Flexibility: It supports a long-term “hybrid” model. You don’t have to move all your data at once; you can manage external catalogs alongside Unity Catalog.

Real-World Applications and Use Cases

Phased Migrations: Move to Unity Catalog incrementally without needing to rewrite your existing ETL code.

Legacy Integration: Connect to legacy Hive metastores or AWS Glue while keeping the governance layer modern.

Cross-Platform Unity: Seamlessly federate data from Snowflake or Salesforce Data 360 using optimized storage paths.

Where Can You Connect? Supported Data Sources

External Hive metastores

Legacy Databricks Hive metastores

AWS Glue

Salesforce Data 360

Snowflake

The “Authorized Paths” Security Layer

One standout feature for Hive metastore federation is Authorized Paths. To prevent users from maliciously redirecting tables to sensitive data locations in an unsecure Hive metastore, admins define specific “authorized” cloud storage sub-paths. This ensures that even if a metadata entry is changed, Unity Catalog won’t allow access unless the path is pre-approved.

Query Federation vs. Catalog Federation

Understanding the difference between these two types of federation is key to choosing the right tool for your use case.

Feature	Query Federation	Catalog Federation
Connection Method	JDBC (Standard database connection)	Direct Object Storage (S3, GCS, ADLS)
Source Type	Databases & Warehouses	Lake Metadata Systems
Query Type	Pushed down to foreign database via JDBC	Direct access to foreign foreign tables via object storage
Compute Engine	Runs on both Databricks + Remote DB Engine	Runs on Databricks Compute Only
Performance	Performance depends on the remote source’s ability to handle the pushed-down query	Performance-optimized and cost-effective as it leverages direct storage access.
Cost	Can be higher (remote compute costs)	Lower (uses native Databricks compute)
Data Format	Any (MySQL, SQL Server, Redshift, etc.)	Open formats (Parquet, Delta, Iceberg)
Best For	Ad-hoc reporting, Proof of Concept (PoCs), Live access to operational databases	Phases migrations to Unity Catalog or maintaining a long-term hybrid model with external catalogs.

Strategic Impact: Unifying the Modern Data Stack

Lakehouse Federation is not just about convenience; it’s a strategic tool for accelerating time-to-insight. By removing the requirement for immediate ETL, data teams can provide business users with access to new data sources in minutes rather than weeks.

For high-volume, low-latency requirements, Databricks recommends Lakeflow Connect as an alternative to federation if the performance of live JDBC connections becomes a bottleneck.

Final Thoughts

In Short the Federation is excellent for:

Exploration

Migration

Operational access

Short-term analytics

But for:

Large analytical workloads

Frequent BI dashboards

Heavy transformations

We should ingest data into Delta Lake and optimize it.

Think of federation as a bridge ----- not always the final destination.

For a deeper technical dive, check out the official Databricks Lakehouse Federation Docs.