cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
agent8
Databricks Employee
Databricks Employee

Intro to federated catalog, and how IAM works under the hood

Lakehouse federation supports both reading and writing to tables in internal HMS within Databricks workspaces while offering read-only access for tables in external HMS and AWS Glue. Asana heavily leverages UC federation with many of its external hive metastore connections. 

Lakehouse federation supports AWS Glue as one of the federation targets (link). To federate into AWS Glue, a user needs to create a connection in Unity Catalog and provide a service credential, which is used to connect to AWS Glue. In the case of AWS Glue, the service credential is the ARN of an IAM role with the requisite permissions (link). 

The IAM role is attached to a custom trust policy, which allows a cross-account trust relationship (link) so that the Unity Catalog Service can assume the IAM role to get access to AWS Glue on behalf of the Databricks users. Additionally, the IAM role has an attached IAM policy that gives it the permissions to invoke the Glue actions specified in the IAM policy.

 

After the connection is created, the next step is to create a foreign catalog that uses this connection. Once the foreign catalog is created, Databricks Runtime uses the associated service credentials from the catalog's connection to make a connection to AWS Glue and fetch the schemas and table metadata to populate the Unity Catalog. The newly created schemas and tables are owned by the catalog owner by default, and the catalog owner can delegate the permissions to access these entities to other users as per the business requirements.

 

When a query is submitted to the Databricks runtime on one of these entities in the foreign catalog, the Databricks runtime refreshes the metadata by refetching the metadata from AWS Glue and updating the Unity Catalog if needed. This way, the metadata is always up to date as of the query time. The high-level flow of how AWS Glue credentials are generated to access the AWS Glue catalog can be seen below.

Networking and IAM Configuration for AWS Glue with Lakehouse Federation.jpgIn the diagram above:

  1. User submits a query on a foreign Glue catalog.
  2. During the query analysis, the runtime cluster fetches the service credentials from Unity Catalog. Unity Catalog assumes the service credential associated with the catalog and generates temporary session credentials, which are returned to the cluster.
  3. Databricks cluster creates an AWS Glue client using the temporary session credentials and fetches the required metadata for tables and schemas.
  4. Databricks cluster then fetches the temporary access credentials to access the object storage path for the table location. Unity catalog generates the temporary session credentials based on the IAM role associated with the storage location.
  5. During the query execution, the Databricks cluster reads the underlying table data on the object storage using temporary session credentials.

 

Networking under the hood for catalog federation to AWS Glue Catalog

Data Plane - Federating to AWS Glue Catalog through Private Link

When deploying Databricks on AWS with a federated catalog, networking between the Databricks data plane and the customer's AWS Glue Data Catalog within their AWS VPC is essential. This typically involves establishing private connectivity between the Databricks workspace's VPC and the customer's VPC using AWS PrivateLink or an interface VPC endpoint.

An interface VPC endpoint within the customer's VPC acts as an entry point for traffic to the AWS Glue Data Catalog. It's associated with a security group controlling catalog access. The Databricks workspace is then configured to use this endpoint. Security groups and network ACLs must allow traffic on the required ports, typically 443. DNS resolution for the AWS Glue Data Catalog via the endpoint may need private DNS zones or a DNS forwarder. Ensuring high availability and monitoring network traffic are crucial for a resilient setup.

 

Screenshot 2025-06-06 at 9.34.20 AM.png

Above figure illustrates an example traffic flow from customer workspace to AWS Glue Catalog through private link. 

 

Data Plane - Federating to AWS Glue Catalog through NAT

 

Traffic to the AWS Glue Data Catalog can also traverse the public internet, though private connectivity is generally preferred for security.

 

Screenshot 2025-06-06 at 9.34.43 AM.png

Above figure illustrates an example traffic flow from customer workspace to AWS Glue Catalog through nat gateway

 

Control Plane - Federating to AWS Glue Catalog through NAT

 

If using serverless compute, networking to the Glue catalog is automatically routed to the public glue endpoint glue.us-west-2.amazonaws.com, as long as the service credential includes the right IAM permissions, it should just work, out of the box! 

Screenshot 2025-06-06 at 9.34.59 AM.png

Above figure illustrates an example traffic flow from serverless cluster to AWS public Glue Catalog

Customers using serverless compute with Glue federation require no set up and work out of the box. For customers using classic compute with Glue federation, we strongly recommend using private networking for Databricks on AWS with a federated catalog for enhanced security and performance. Get in touch with your account team to get more insights into IAM and networking best practices.