Lakehouse federation supports both reading and writing to tables in internal HMS within Databricks workspaces while offering read-only access for tables in external HMS and AWS Glue. Asana heavily leverages UC federation with many of its external hive metastore connections.
Lakehouse federation supports AWS Glue as one of the federation targets (link). To federate into AWS Glue, a user needs to create a connection in Unity Catalog and provide a service credential, which is used to connect to AWS Glue. In the case of AWS Glue, the service credential is the ARN of an IAM role with the requisite permissions (link).
The IAM role is attached to a custom trust policy, which allows a cross-account trust relationship (link) so that the Unity Catalog Service can assume the IAM role to get access to AWS Glue on behalf of the Databricks users. Additionally, the IAM role has an attached IAM policy that gives it the permissions to invoke the Glue actions specified in the IAM policy.
After the connection is created, the next step is to create a foreign catalog that uses this connection. Once the foreign catalog is created, Databricks Runtime uses the associated service credentials from the catalog's connection to make a connection to AWS Glue and fetch the schemas and table metadata to populate the Unity Catalog. The newly created schemas and tables are owned by the catalog owner by default, and the catalog owner can delegate the permissions to access these entities to other users as per the business requirements.
When a query is submitted to the Databricks runtime on one of these entities in the foreign catalog, the Databricks runtime refreshes the metadata by refetching the metadata from AWS Glue and updating the Unity Catalog if needed. This way, the metadata is always up to date as of the query time. The high-level flow of how AWS Glue credentials are generated to access the AWS Glue catalog can be seen below.
In the diagram above:
When deploying Databricks on AWS with a federated catalog, networking between the Databricks data plane and the customer's AWS Glue Data Catalog within their AWS VPC is essential. This typically involves establishing private connectivity between the Databricks workspace's VPC and the customer's VPC using AWS PrivateLink or an interface VPC endpoint.
An interface VPC endpoint within the customer's VPC acts as an entry point for traffic to the AWS Glue Data Catalog. It's associated with a security group controlling catalog access. The Databricks workspace is then configured to use this endpoint. Security groups and network ACLs must allow traffic on the required ports, typically 443. DNS resolution for the AWS Glue Data Catalog via the endpoint may need private DNS zones or a DNS forwarder. Ensuring high availability and monitoring network traffic are crucial for a resilient setup.
Above figure illustrates an example traffic flow from customer workspace to AWS Glue Catalog through private link.
Traffic to the AWS Glue Data Catalog can also traverse the public internet, though private connectivity is generally preferred for security.
Above figure illustrates an example traffic flow from customer workspace to AWS Glue Catalog through nat gateway
If using serverless compute, networking to the Glue catalog is automatically routed to the public glue endpoint glue.us-west-2.amazonaws.com, as long as the service credential includes the right IAM permissions, it should just work, out of the box!
Above figure illustrates an example traffic flow from serverless cluster to AWS public Glue Catalog
Customers using serverless compute with Glue federation require no set up and work out of the box. For customers using classic compute with Glue federation, we strongly recommend using private networking for Databricks on AWS with a federated catalog for enhanced security and performance. Get in touch with your account team to get more insights into IAM and networking best practices.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.