- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Summary
- Unity Catalog is the most powerful, open, and interoperable catalog for Data and AI.
- This blog post provides a step-by-step guide on configuring and accessing Unity Catalog tables from EMR Spark and EMR Trino clusters using Iceberg REST Catalog's open API.
- This allows organizations to reduce data duplication by centrally managing and securing a single copy of their data assets in Databricks Unity Catalog while enabling governed access to Unity Catalog tables from AWS EMR.
Databricks Unity Catalog (UC) is the industry’s only unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform. Unity Catalog provides a single source of truth for your organization’s data and AI assets, providing open connectivity to any data source, common format, unified governance with detailed lineage tracking, comprehensive monitoring, and support for open sharing and collaboration.
With open APIs and credential vending, Unity Catalog enables external engines like Trino, DuckDB, Apache Spark™, Daft, and other Iceberg REST catalog-integrated engines like Dremio to access its governed data. This interoperability is particularly valuable for teams leveraging AWS EMR for open source analytics workloads. Data teams often resort to copying data across platforms, which creates data silos that increase the risk of unauthorized access, and complicates compliance. In this blog, we’ll show how organizations can break down these data silos by enabling direct access to Unity Catalog tables from AWS EMR. This minimizes data duplication, allowing organizations to use a single copy of data across different analytics and AI workloads with unified governance. Customers can use Databricks’ best-in-class ETL price/performance for the upstream data processing and access the published data by integrating Unity Catalog with EMR Spark and EMR Trino through the Iceberg REST Catalog.
In this blog post, we’ll explore the Iceberg REST Catalog's (IRC) value and provide a step-by-step guide on configuring and accessing Unity Catalog tables from EMR Spark and EMR Trino clusters.
Iceberg REST API Catalog Integration
Apache Iceberg™ maintains atomicity and consistency by creating new metadata files for each table change. The Iceberg catalog tracks the new metadata per write and ensures incomplete writes do not corrupt an existing metadata file. The Iceberg REST catalog API is a standardized, open API specification that provides a unified interface for Iceberg catalogs. It decouples catalog implementations from clients and solves interoperability across engines and catalogs.
Unity Catalog (UC) implements the Iceberg REST catalog interface, enabling interoperability with any engine integrated with the Iceberg REST Catalog, such as Apache Spark™, Trino, Dremio, and Snowflake. Unity Catalog’s Iceberg REST Catalog endpoints allow external systems to access tables via open APIs while benefiting from performance enhancements like Liquid Clustering and Predictive Optimization. At the same time, Databricks workloads continue to benefit from advanced Unity Catalog features like Change Data Feed.
Securing Access via Credential Vending
Unity Catalog’s credential vending dynamically issues temporary credentials for secure access to cloud storage. When an external engine, such as Trino, requests data from an Iceberg table registered in a UC metastore, Unity Catalog generates short-lived credentials using IAM roles or managed identities to the specific dataset being queried by the user and storage URLs of the dataset, thereby eliminating the manual credential management while maintaining security and compliance. The detailed steps are captured in the diagram below.
Figure 1. Steps in the data flow to securely access data assets with credential vending
Experiencing EMR Spark and Trino in Action with Unity Catalog’s Open APIs
In this section, we’ll look at accessing the Iceberg tables registered in Databricks Unity Catalog using EMR Spark and Trino. We’ll walk through the following steps:
- Setting up the Unity Catalog Iceberg tables from the Databricks workspace
- Setting up AWS EMR Spark and Trino
- Test EMR Spark and Trino connectivity to Databricks Unity Catalog
- EMR Trino SQL - Read and Write Managed Iceberg tables
- EMR Spark SQL - Read and Write Managed Iceberg tables
- Performing UC access control test
- Using OAuth for connecting to Unity Catalog IRC configuration
Step 1: Setting up the Unity Catalog Iceberg tables from the Databricks workspace
The blog assumes the Unity Catalog enabled Workspace setup and Account principles are configured with proper authorization and authentication. To get started with Unity Catalog, follow the Unity Catalog Setup Guide.
Personal Access Tokens (PATs) are essential for authenticating API requests when integrating external tools or automating workflows in Databricks. To create a PAT, follow the Databricks PAT Setup Guide. Log in to your Databricks workspace, navigate to "User Settings," and generate a token with a specific lifespan and permissions. Save the token securely, as it cannot be retrieved later.
Databricks enables access to Unity Catalog tables through the Unity REST API and the Iceberg REST catalog, offering seamless integration with external systems. For more details, refer to Access Databricks data using external systems. To facilitate external data access, a metastore administrator can enable the capability for each metastore that requires external connectivity. Additionally, the user or service principal configuring the connection must possess the EXTERNAL USE SCHEMA privilege for every schema containing tables intended for external reads.
We will use the following Unity Catalog SQL commands from our Databricks workspace to create a catalog and schema, manage Iceberg and Delta tables with records, and grant permissions to the principal associated with the PAT token.
Note: We used the TPCH sample datasets available in the Databricks samples catalog for this example.
The Databricks Principal used in this example was given all the necessary UC permissions (such as Use Catalog, Use Schema, External Use Schema, and Create table) to perform the activities.
|
Step 2: Setting up AWS EMR Spark and Trino
Following the EMR management guide instructions, you can create an EMR cluster. To access your Unity Catalog tables using EMR Trino, you additionally need to configure the Trino external catalog using the bootstrap configurations. Below, we have provided the steps for bootstrapping the trino catalog properties file.
- Create a Catalog Configuration properties file - databricks_uc_irc_catalog.properties
# etc/trino/conf/catalog/databricks_uc_irc_catalog.properties connector.name=iceberg iceberg.catalog.type=rest iceberg.rest-catalog.uri=https://<databricks_workspace_url>/api/2.1/unity-catalog/iceberg iceberg.rest-catalog.warehouse=databricks_uc_irc_catalog iceberg.rest-catalog.security=OAUTH2 iceberg.rest-catalog.oauth2.token=<databricks_principal_pat_token> iceberg.rest-catalog.vended-credentials-enabled=true fs.native-s3.enabled=true s3.region=us-east-2 - Upload the properties file to the S3 location accessible by the EMR IAM role.
- As shown below, create a shell script (e.g., emr_uc_iceberg_catalog_bootstrap.sh). This shell script copies the Trino catalog properties file from the S3 location to the EMR EC2 node file system path during bootstrapping.
set -ex sudo aws s3 cp s3://emr-irc-testing-uc-bucket/emrconfigs/iceberg_rest_catalog_properties/databricks_demo.properties /etc/trino/conf/catalog/databricks_uc_irc_catalog.propertiesNote: While setting up your EMR cluster, add the “Bootstrap actions” and then fill in the details as shown below.
Step 3: Test EMR Spark and Trino connectivity to Databricks Unity Catalog
To perform EMR Connectivity testing using SSH, we will have to follow the following steps: For details on connecting to the EMR primary node using SSH, refer here.
- Allowlist your device's IP address for SSH port (Number 22) in the security group associated with the EMR primary node.
- Now, use the link “Connect to the Primary node using SSH” from the EMR cluster details page to get the command to establish a connection to the primary node. It looks like this
ssh -i ~/<<Your PEM File>>.pem hadoop@<<EMR Primary Node DNS>> - Once in the EMR master node shell,
- To access Trino in the EMR shell, run the command “trino-cli”
[hadoop@ip-10-4-7-67 ~]$ trino-cli trino> SHOW CATALOGS; Catalog --------------------- databricks_uc_irc_catalog hive hudi system (4 rows) - Alternatively, you can initiate a Spark SQL terminal with the following command prompt.
spark-sql --name "uc-iceberg" \ --master "local[*]" \ --packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.1,org.apache.hadoop:hadoop-common:3.4.1,org.apache.hadoop:hadoop-aws:3.4.1" \ --conf "spark.hadoop.fs.s3.impl=org.apache.iceberg.aws.s3.S3FileIO" \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.catalog.databricks_uc_irc_catalog=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.databricks_uc_irc_catalog.catalog-impl=org.apache.iceberg.rest.RESTCatalog" \ --conf "spark.sql.catalog.databricks_uc_irc_catalog.uri=https://<<Your Workspace URL>>/api/2.1/unity-catalog/iceberg-rest" \ --conf "spark.sql.defaultCatalog=databricks_uc_irc_catalog" \ --conf "spark.sql.catalog.databricks_uc_irc_catalog.warehouse=databricks_uc_irc_catalog" \ --conf "spark.sql.catalog.databricks_uc_irc_catalog.token=<<Your Databricks PAT>>"
Step 4: EMR Trino SQL - Read and Write Managed Iceberg tables
Let us perform some DML operations using EMR Trino. EMR Trino is configured (refer to the EMR bootstrapping section from Step 2 above) to read tables from the Databricks UC catalog. You can use the Trino terminal to perform SQL queries. We will use standard ANSI SQL queries to retrieve data, perform aggregations, or join tables.
It's important to note that the Trino Iceberg REST Catalog is configured only to recognize Iceberg tables. Consequently, Delta tables are absent from the catalog schema.
|