cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
AnthonyAnand
New Contributor III

 

Focusing on the Core

In the modern data landscape, data is often scattered across various specialized databases, cloud warehouses, and legacy systems. Traditionally, unifying this data required complex Extract, Transform, Load (ETL) pipelines to move everything into a single central repository. Lakehouse Federation in Databricks changes this paradigm, allowing you to query external data sources directly from your Databricks workspace without moving the data first. 

Understanding Lakehouse Federation 

Lakehouse Federation is a query federation platform integrated within Databricks Unity Catalog. It enables users to run queries against multiple external data sources as if they were local tables in the Databricks Lakehouse. This approach minimizes data movement, maintains live access to operational systems, and simplifies data governance. 

There are two primary ways Databricks handles this: Query Federation and Catalog Federation. 

What is  Query  Federation? 

Query Federation allows you to run SQL queries against external databases without migrating the data into Databricks. When a query is executed, Databricks “pushes down” parts of the query to the foreign database using JDBC. This means the work is distributed between Databricks compute and the remote database’s compute engine. 

AnthonyAnand_0-1773408861179.png

from databricks docs 

The Case for Query Federation: Why It Matters 

It is the “fast-track” to data insights, particularly in scenarios where full data ingestion isn’t practical: 

  • On-Demand Reporting: Access live operational data for real-time business reports. 
  • Rapid Prototyping: Perform Proof-of-Concept (PoC) work or exploratory analysis before committing to a permanent ETL pipeline. 
  • Incremental Migration: Maintain business continuity by querying legacy systems while slowly moving data to the Lakehouse. 
  • In-Place Governance: Apply Unity Catalog’s fine-grained access controls, lineage, and discovery tools to data that stays in its original source. 

Choosing Your Path: Query Federation vs. Lakeflow Connect 

Databricks offers two ways to handle external data: Query Federation and Lakeflow Connect. 

  • Choose Query Federation for ad-hoc, exploratory, or low-volume reporting where you want to avoid data movement. 
  • Choose Lakeflow Connect for high-volume production workloads that require low-latency and scalability through managed ingestion. 

Ecosystem Compatibility and Supported Connectivity 

Query Federation supports a massive variety of popular data sources, including: 

  • Cloud Warehouses: Amazon Redshift, Google BigQuery, Snowflake, and Azure Synapse. 
  • Relational Databases: MySQL, PostgreSQL, Microsoft SQL Server, and Oracle. 
  • Specialized Sources: Teradata and Salesforce Data 360. 

How it Works: Connecting MySQL in 5 Minutes 

Setting up federation is designed to be straightforward within the Unity Catalog framework: 

  1. Create a Connection: Define a connection in Unity Catalog using your credentials (e.g., username/password) and the JDBC URL for your source. 

AnthonyAnand_1-1773408861180.png

 

  1.  Create a Foreign Catalog: Use the connection to create a “foreign catalog.” This catalog acts as a mirror of the external database with its structure in Unity Catalog, making it appear as a standard catalog that can be queried immediately. 

AnthonyAnand_2-1773408861181.png

If you use the API to create a connection to the data source, you will need to create foreign catalog creation seperately. 

AnthonyAnand_3-1773408861182.png

 

  1.  Grant Permissions: Manage access at a granular level. You can grant privileges to users or groups on specific tables or schemas within the foreign catalog just like any other Unity Catalog object. 

AnthonyAnand_4-1773408861183.png

 

  1.  Metadata: We can specify system or governed tags as key-value pairs. 

AnthonyAnand_5-1773408861184.png

 

 

  1.  Query Away: Run standard SQL queries. For Query Federation, Unity Catalog automatically handles the translation and push-down to the external engine. 

Reference image from databricks

AnthonyAnand_6-1773408861185.png

 

Reference image from Mysql

AnthonyAnand_7-1773408861186.png

Know the Limits: Key Technical Constraints 

  • Read-Only: Federated queries are read-only; you cannot write back to the external database through this interface. 
  • Compute Overhead: Performance depends on the remote system’s ability to handle the pushed-down query. 
  • Memory Considerations: If a subquery returns an extremely large result set to a single Databricks task, it can lead to memory pressure on the executor. 

What is  Catalog  Federation? 

Unlike standard query federation that pushes queries to a remote database engine, Catalog Federation allows Unity Catalog to directly access foreign tables in object storage (S3, ADLS, GCS). By bypassing the “middle-man” compute of a remote database, your queries run entirely on Databricks compute, making them significantly faster and more cost-effective. 

AnthonyAnand_8-1773408861187.png

from databricks docs 

Why Catalog Federation Wins on Performance 

  • Performance Optimized: Since data is read directly from object storage, you eliminate the overhead and latency of JDBC translations. 
  • Centralized Governance: You get the full suite of Unity Catalog features — centralized access control, data lineage, auditing, and global search — for data that technically lives outside your main metastore. 
  • Hybrid Flexibility: It supports a long-term “hybrid” model. You don’t have to move all your data at once; you can manage external catalogs alongside Unity Catalog. 

Real-World Applications and Use Cases 

  1. Phased Migrations: Move to Unity Catalog incrementally without needing to rewrite your existing ETL code. 
  1. Legacy Integration: Connect to legacy Hive metastores or AWS Glue while keeping the governance layer modern. 
  1. Cross-Platform Unity: Seamlessly federate data from Snowflake or Salesforce Data 360 using optimized storage paths. 

Where Can You Connect? Supported Data Sources 

  • External Hive metastores  
  • Legacy Databricks Hive metastores 
  • AWS Glue 
  • Salesforce Data 360 
  • Snowflake 

The “Authorized Paths” Security Layer 

One standout feature for Hive metastore federation is Authorized Paths. To prevent users from maliciously redirecting tables to sensitive data locations in an unsecure Hive metastore, admins define specific “authorized” cloud storage sub-paths. This ensures that even if a metadata entry is changed, Unity Catalog won’t allow access unless the path is pre-approved. 

 

Query Federation vs. Catalog Federation 

Understanding the difference between these two types of federation is key to choosing the right tool for your use case. 

Feature 

Query Federation  

Catalog Federation 

Connection Method 

JDBC (Standard database connection) 

Direct Object Storage (S3, GCS, ADLS) 

Source Type 

Databases & Warehouses 

Lake Metadata Systems 

Query Type 

Pushed down to foreign database via JDBC 

Direct access to foreign foreign tables via object storage 

Compute Engine 

Runs on both Databricks + Remote DB Engine 

Runs on Databricks Compute Only 

Performance 

Performance depends on the remote source’s ability to handle the pushed-down query 

Performance-optimized and cost-effective as it leverages direct storage access. 

Cost 

Can be higher (remote compute costs) 

Lower (uses native Databricks compute) 

Data Format 

Any (MySQL, SQL Server, Redshift, etc.) 

Open formats (Parquet, Delta, Iceberg) 

Best For 

Ad-hoc reporting, Proof of Concept (PoCs), Live access to operational databases 

Phases migrations to Unity Catalog or maintaining a long-term hybrid model with external catalogs. 

 

Strategic Impact: Unifying the Modern Data Stack 

Lakehouse Federation is not just about convenience; it’s a strategic tool for accelerating time-to-insight. By removing the requirement for immediate ETL, data teams can provide business users with access to new data sources in minutes rather than weeks. 

For high-volume, low-latency requirements, Databricks recommends Lakeflow Connect as an alternative to federation if the performance of live JDBC connections becomes a bottleneck. 

Final Thoughts

In Short the Federation is excellent for: 

  • Exploration 
  • Migration 
  • Operational access 
  • Short-term analytics 

But for: 

  • Large analytical workloads 
  • Frequent BI dashboards 
  • Heavy transformations 

We should ingest data into Delta Lake and optimize it. 

Think of federation as a bridge ----- not always the final destination. 

For a deeper technical dive, check out the official Databricks Lakehouse Federation Docs.