Performing DML Operations on Databricks Unity Cata... - Databricks Community

sreeramthoom · ‎07-02-2025

Summary

Unity Catalog is the most powerful, open, and interoperable catalog for Data and AI.
This blog post provides a step-by-step guide on configuring and accessing Unity Catalog tables from EMR Spark and EMR Trino clusters using Iceberg REST Catalog's open API.
This allows organizations to reduce data duplication by centrally managing and securing a single copy of their data assets in Databricks Unity Catalog while enabling governed access to Unity Catalog tables from AWS EMR.

Databricks Unity Catalog (UC) is the industry’s only unified and open governance solution for data and AI, built into the Databricks Data Intelligence Platform. Unity Catalog provides a single source of truth for your organization’s data and AI assets, providing open connectivity to any data source, common format, unified governance with detailed lineage tracking, comprehensive monitoring, and support for open sharing and collaboration.

With open APIs and credential vending, Unity Catalog enables external engines like Trino, DuckDB, Apache Spark™, Daft, and other Iceberg REST catalog-integrated engines like Dremio to access its governed data. This interoperability is particularly valuable for teams leveraging AWS EMR for open source analytics workloads. Data teams often resort to copying data across platforms, which creates data silos that increase the risk of unauthorized access, and complicates compliance. In this blog, we’ll show how organizations can break down these data silos by enabling direct access to Unity Catalog tables from AWS EMR. This minimizes data duplication, allowing organizations to use a single copy of data across different analytics and AI workloads with unified governance. Customers can use Databricks’ best-in-class ETL price/performance for the upstream data processing and access the published data by integrating Unity Catalog with EMR Spark and EMR Trino through the Iceberg REST Catalog.

In this blog post, we’ll explore the Iceberg REST Catalog's (IRC) value and provide a step-by-step guide on configuring and accessing Unity Catalog tables from EMR Spark and EMR Trino clusters.

Iceberg REST API Catalog Integration

Apache Iceberg™ maintains atomicity and consistency by creating new metadata files for each table change. The Iceberg catalog tracks the new metadata per write and ensures incomplete writes do not corrupt an existing metadata file. The Iceberg REST catalog API is a standardized, open API specification that provides a unified interface for Iceberg catalogs. It decouples catalog implementations from clients and solves interoperability across engines and catalogs.

Unity Catalog (UC) implements the Iceberg REST catalog interface, enabling interoperability with any engine integrated with the Iceberg REST Catalog, such as Apache Spark™, Trino, Dremio, and Snowflake. Unity Catalog’s Iceberg REST Catalog endpoints allow external systems to access tables via open APIs while benefiting from performance enhancements like Liquid Clustering and Predictive Optimization. At the same time, Databricks workloads continue to benefit from advanced Unity Catalog features like Change Data Feed.

Securing Access via Credential Vending

Unity Catalog’s credential vending dynamically issues temporary credentials for secure access to cloud storage. When an external engine, such as Trino, requests data from an Iceberg table registered in a UC metastore, Unity Catalog generates short-lived credentials using IAM roles or managed identities to the specific dataset being queried by the user and storage URLs of the dataset, thereby eliminating the manual credential management while maintaining security and compliance. The detailed steps are captured in the diagram below.

Figure 1. Steps in the data flow to securely access data assets with credential vending

Experiencing EMR Spark and Trino in Action with Unity Catalog’s Open APIs

In this section, we’ll look at accessing the Iceberg tables registered in Databricks Unity Catalog using EMR Spark and Trino. We’ll walk through the following steps:

Setting up the Unity Catalog Iceberg tables from the Databricks workspace
Setting up AWS EMR Spark and Trino
Test EMR Spark and Trino connectivity to Databricks Unity Catalog
EMR Trino SQL - Read and Write Managed Iceberg tables
EMR Spark SQL - Read and Write Managed Iceberg tables
Performing UC access control test
Using OAuth for connecting to Unity Catalog IRC configuration

Step 1: Setting up the Unity Catalog Iceberg tables from the Databricks workspace

The blog assumes the Unity Catalog enabled Workspace setup and Account principles are configured with proper authorization and authentication. To get started with Unity Catalog, follow the Unity Catalog Setup Guide.

Personal Access Tokens (PATs) are essential for authenticating API requests when integrating external tools or automating workflows in Databricks. To create a PAT, follow the Databricks PAT Setup Guide. Log in to your Databricks workspace, navigate to "User Settings," and generate a token with a specific lifespan and permissions. Save the token securely, as it cannot be retrieved later.

Databricks enables access to Unity Catalog tables through the Unity REST API and the Iceberg REST catalog, offering seamless integration with external systems. For more details, refer to Access Databricks data using external systems. To facilitate external data access, a metastore administrator can enable the capability for each metastore that requires external connectivity. Additionally, the user or service principal configuring the connection must possess the EXTERNAL USE SCHEMA privilege for every schema containing tables intended for external reads.

We will use the following Unity Catalog SQL commands from our Databricks workspace to create a catalog and schema, manage Iceberg and Delta tables with records, and grant permissions to the principal associated with the PAT token.

Note: We used the TPCH sample datasets available in the Databricks samples catalog for this example.

The Databricks Principal used in this example was given all the necessary UC permissions (such as Use Catalog, Use Schema, External Use Schema, and Create table) to perform the activities.

CREATE CATALOG databricks_uc_irc_catalog;
CREATE SCHEMA databricks_uc_irc_catalog.uc_irc_schema;

GRANT EXTERNAL USE SCHEMA ON SCHEMA databricks_uc_irc_catalog.uc_irc_schema TO `<<Your Databricks Principal>>`

USE CATALOG databricks_uc_irc_catalog;
USE SCHEMA uc_irc_schema;

-- Delta tables
create table nation_managed_delta deep clone samples.tpch.nation;
create table supplier_managed_delta deep clone samples.tpch.supplier;
create table region_managed_delta deep clone samples.tpch.region;

-- Iceberg tables

CREATE TABLE customers_managed_iceberg
USING ICEBERG 
SELECT * FROM samples.tpch.customer;

CREATE TABLE lineitem_managed_iceberg
USING ICEBERG 
SELECT * FROM samples.tpch.lineitem;

CREATE TABLE orders_managed_iceberg
USING ICEBERG 
SELECT * FROM samples.tpch.orders;


-- Iceberg External tables using UniForm

CREATE TABLE IF NOT EXISTS customers_external_iceberg
DEEP CLONE samples.tpch.customer
LOCATION "s3://<BUCKET_NAME>/external/iceberg/";


ALTER TABLE customers_external_iceberg SET TBLPROPERTIES (
 'delta.minReaderVersion' = '2',
 'delta.minWriterVersion' = '5',
 'delta.columnMapping.mode' = 'name',
 'delta.enableIcebergCompatV2' = 'true',
 'delta.universalFormat.enabledFormats' = 'iceberg'
);

Step 2: Setting up AWS EMR Spark and Trino

Following the EMR management guide instructions, you can create an EMR cluster. To access your Unity Catalog tables using EMR Trino, you additionally need to configure the Trino external catalog using the bootstrap configurations. Below, we have provided the steps for bootstrapping the trino catalog properties file.

Create a Catalog Configuration properties file - databricks_uc_irc_catalog.properties

# etc/trino/conf/catalog/databricks_uc_irc_catalog.properties
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=https://<databricks_workspace_url>/api/2.1/unity-catalog/iceberg
iceberg.rest-catalog.warehouse=databricks_uc_irc_catalog
iceberg.rest-catalog.security=OAUTH2
iceberg.rest-catalog.oauth2.token=<databricks_principal_pat_token>
iceberg.rest-catalog.vended-credentials-enabled=true
fs.native-s3.enabled=true
s3.region=us-east-2

Upload the properties file to the S3 location accessible by the EMR IAM role.

As shown below, create a shell script (e.g., emr_uc_iceberg_catalog_bootstrap.sh). This shell script copies the Trino catalog properties file from the S3 location to the EMR EC2 node file system path during bootstrapping.

set -ex
sudo aws s3 cp s3://emr-irc-testing-uc-bucket/emrconfigs/iceberg_rest_catalog_properties/databricks_demo.properties /etc/trino/conf/catalog/databricks_uc_irc_catalog.properties

Note: While setting up your EMR cluster, add the “Bootstrap actions” and then fill in the details as shown below.

Step 3: Test EMR Spark and Trino connectivity to Databricks Unity Catalog

To perform EMR Connectivity testing using SSH, we will have to follow the following steps: For details on connecting to the EMR primary node using SSH, refer here.

Allowlist your device's IP address for SSH port (Number 22) in the security group associated with the EMR primary node.
Now, use the link “Connect to the Primary node using SSH” from the EMR cluster details page to get the command to establish a connection to the primary node. It looks like this
```
ssh -i ~/<<Your PEM File>>.pem hadoop@<<EMR Primary Node DNS>>
```
Once in the EMR master node shell,

To access Trino in the EMR shell, run the command “trino-cli”

[hadoop@ip-10-4-7-67 ~]$ trino-cli
trino> SHOW CATALOGS;
      Catalog      
---------------------
databricks_uc_irc_catalog
hive               
hudi               
system             
(4 rows)

Alternatively, you can initiate a Spark SQL terminal with the following command prompt.

spark-sql --name "uc-iceberg" \
   --master "local[*]" \
   --packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.1,org.apache.hadoop:hadoop-common:3.4.1,org.apache.hadoop:hadoop-aws:3.4.1" \
   --conf "spark.hadoop.fs.s3.impl=org.apache.iceberg.aws.s3.S3FileIO" \
   --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
   --conf "spark.sql.catalog.databricks_uc_irc_catalog=org.apache.iceberg.spark.SparkCatalog" \
   --conf "spark.sql.catalog.databricks_uc_irc_catalog.catalog-impl=org.apache.iceberg.rest.RESTCatalog" \
   --conf "spark.sql.catalog.databricks_uc_irc_catalog.uri=https://<<Your Workspace URL>>/api/2.1/unity-catalog/iceberg-rest" \
   --conf "spark.sql.defaultCatalog=databricks_uc_irc_catalog" \
   --conf "spark.sql.catalog.databricks_uc_irc_catalog.warehouse=databricks_uc_irc_catalog" \
   --conf "spark.sql.catalog.databricks_uc_irc_catalog.token=<<Your Databricks PAT>>"

Step 4: EMR Trino SQL - Read and Write Managed Iceberg tables

Let us perform some DML operations using EMR Trino. EMR Trino is configured (refer to the EMR bootstrapping section from Step 2 above) to read tables from the Databricks UC catalog. You can use the Trino terminal to perform SQL queries. We will use standard ANSI SQL queries to retrieve data, perform aggregations, or join tables.

[hadoop@ip-10-4-7-67 ~]$ trino-cli
trino> SHOW CATALOGS;
      Catalog      
---------------------
databricks_uc_irc_catalog
hive               
hudi               
system             
(4 rows)

Query 20250328_202001_00000_eqfxa, FINISHED, 2 nodes
Splits: 8 total, 8 done (100.00%)
1.31 [0 rows, 0B] [0 rows/s, 0B/s]

trino> SHOW SCHEMAS IN databricks_uc_irc_catalog;
      Schema      
--------------------
default           
information_schema
uc_irc_schema   
(3 rows)

Query 20250328_202019_00001_eqfxa, FINISHED, 3 nodes
Splits: 8 total, 8 done (100.00%)
1.51 [3 rows, 55B] [1 rows/s, 36B/s]

trino> USE databricks_uc_irc_catalog.uc_irc_schema;

It's important to note that the Trino Iceberg REST Catalog is configured only to recognize Iceberg tables. Consequently, Delta tables are absent from the catalog schema.

trino:uc_irc_schema> SHOW TABLES;
 Table  
----------
customers_managed_iceberg
lineitem_managed_iceberg
orders_managed_iceberg  
customers_external_iceberg
(4 rows)

Query 20250328_202517_00007_eqfxa, FINISHED, 3 nodes
Splits: 8 total, 8 done (100.00%)
0.38 [3 rows, 97B] [7 rows/s, 256B/s]


trino:uc_irc_schema> SELECT
                   ->     l_returnflag,
                   ->     l_linestatus,
                   ->     SUM(l_quantity) AS sum_qty,
                   ->     SUM(l_extendedprice) AS sum_base_price,
                   ->     SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
                   ->     SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
                   ->     AVG(l_quantity) AS avg_qty,
                   ->     AVG(l_extendedprice) AS avg_price,
                   ->     AVG(l_discount) AS avg_disc,
                   ->     COUNT(*) AS count_order
                   -> FROM
                   ->     lineitem_managed_iceberg
                   -> WHERE
                   ->     l_shipdate <= DATE '1998-12-01' - INTERVAL '90' DAY
                   -> GROUP BY
                   ->     l_returnflag,
                   ->     l_linestatus
                   -> ORDER BY
                   ->     l_returnflag,
                   ->     l_linestatus;
l_returnflag | l_linestatus |   sum_qty    | sum_base_price  |  sum_disc_price   |     sum_charge      | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+--------------+-----------------+-------------------+---------------------+---------+-----------+----------+-------------
A            | F            | 188818373.00 | 283107483036.12 | 268952035589.0630 | 279714361804.228122 |   25.50 |  38237.67 |     0.05 |     7403889
N            | F            |   4913382.00 |   7364213967.95 |   6995782725.6633 |   7275821143.989585 |   25.53 |  38267.78 |     0.05 |      192439
N            | O            | 371626663.00 | 557251817321.64 | 529391298998.6177 | 550573649707.749900 |   25.50 |  38233.71 |     0.05 |    14574883
R            | F            | 188960009.00 | 283310887148.20 | 269147687267.2029 | 279912972474.864338 |   25.51 |  38252.41 |     0.05 |     7406353
(4 rows)

Query 20250328_202610_00008_eqfxa, FINISHED, 2 nodes
Splits: 25 total, 25 done (100.00%)
8.10 [30M rows, 253MiB] [3.7M rows/s, 31.2MiB/s]

trino:uc_irc_schema> SELECT
                   ->     l_orderkey,
                   ->     SUM(l_extendedprice * (1 - l_discount)) AS revenue,
                   ->     o_orderdate,
                   ->     o_shippriority
                   -> FROM
                   ->     customers_managed_iceberg c
                   ->     JOIN orders_managed_iceberg o ON c.c_custkey = o.o_custkey
                   ->     JOIN lineitem_managed_iceberg l ON l.l_orderkey = o.o_orderkey
                   -> WHERE
                   ->     c.c_mktsegment = 'BUILDING'
                   ->     AND o.o_orderdate < DATE '1995-03-15'
                   ->     AND l.l_shipdate > DATE '1995-03-15'
                   -> GROUP BY
                   ->     l_orderkey,
                   ->     o_orderdate,
                   ->     o_shippriority
                   -> ORDER BY
                   ->     revenue DESC,
                   ->     o_orderdate
                   -> LIMIT 20;
l_orderkey |   revenue   | o_orderdate | o_shippriority
------------+-------------+-------------+----------------
  18869634 | 541426.1669 | 1995-01-10  |              0
   2845094 | 450279.8097 | 1995-03-06  |              0
  16716836 | 432402.2306 | 1995-01-28  |              0
  25345699 | 431791.2769 | 1995-02-15  |              0
    568514 | 423439.7864 | 1995-02-18  |              0
   9844418 | 413413.2048 | 1995-02-22  |              0
  12783202 | 412259.8661 | 1995-03-07  |              0
  25342753 | 411832.8838 | 1995-01-07  |              0
   4011108 | 398799.3850 | 1995-03-10  |              0
  28708865 | 398227.3287 | 1995-02-15  |              0
   1000004 | 397918.5426 | 1995-03-02  |              0
  10725381 | 397623.7508 | 1995-03-12  |              0
   4860004 | 394207.2591 | 1995-02-22  |              0
  16002339 | 393607.8484 | 1995-02-06  |              0
   1083941 | 392686.9967 | 1995-02-21  |              0
  24062117 | 391641.4971 | 1995-02-08  |              0
   2529826 | 389595.2070 | 1995-02-17  |              0
  24392391 | 388391.8549 | 1995-02-23  |              0
  14444676 | 387978.3972 | 1995-03-11  |              0
  12935522 | 386509.0814 | 1995-02-16  |              0
(20 rows)

Query 20250328_202645_00009_eqfxa, FINISHED, 2 nodes
Splits: 45 total, 45 done (100.00%)
6.51 [38.6M rows, 337MiB] [5.92M rows/s, 51.8MiB/s]

The following steps demonstrate the changes in column values queried via the EMR Trino terminal before and after performing DML and DDL operations on the tables using the Databricks workspace. Details are provided in the steps below.

Performing DML operations

Let's examine the value of the address from the Trino terminal, before a DML is performed from the Databricks workspace

trino:irc_demo_schema> select c_address from  customers_managed_iceberg where c_custkey = 412446;
             c_address             
-------------------------------------
5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg 
(1 row)

Query 20250404_215551_00012_eqfxa, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
1.86 [750K rows, 10.9MiB] [402K rows/s, 5.85MiB/s]

The following DML statement is executed successfully from the Databricks workspace to update the customer address.

update databricks_uc_irc_catalog.uc_irc_schema.customers_managed_iceberg
set c_address = 'New customer secondary address for customer # 412446'
where c_custkey = 412446;

Now, the address change is reflected when queried from the Trino terminal.

trino:irc_demo_schema> select c_address from  customers_managed_iceberg where c_custkey = 412446;
                     c_address                      
------------------------------------------------------
New customer secondary address for customer # 412446
(1 row)

Query 20250404_215917_00013_eqfxa, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
1.01 [750K rows, 8.81MiB] [743K rows/s, 8.72MiB/s]

Performing DDL changes

Table definition is changed from the Databricks workspace, and the change is validated from EMR Trino. Details are provided in the steps below.

Let's examine the table definition (DDL) of orders from the Trino terminal, before a DDL change is performed from the Databricks workspace.

trino:irc_demo_schema> describe orders_managed_iceberg;
    Column      |     Type      | Extra | Comment
-----------------+---------------+-------+---------
o_orderkey      | bigint        |       |        
o_custkey       | bigint        |       |        
o_orderstatus   | varchar       |       |        
o_totalprice    | decimal(18,2) |       |        
o_orderdate     | date          |       |        
o_orderpriority | varchar       |       |        
o_clerk         | varchar       |       |        
o_shippriority  | integer       |       |        
o_comment       | varchar       |       |        
(9 rows)

Query 20250404_220151_00017_eqfxa, FINISHED, 3 nodes
Splits: 8 total, 8 done (100.00%)
0.58 [9 rows, 624B] [15 rows/s, 1.05KiB/s]

A DDL statement is executed successfully from the Databricks workspace to add a new column to the orders table.
```
ALTER TABLE databricks_uc_irc_catalog.uc_irc_schema.orders_managed_iceberg
ADD COLUMNS ( free_text STRING);
```

Now, the Trino terminal reflects the table definition change when the orders table is described

trino:irc_demo_schema> describe orders_managed_iceberg;
    Column      |     Type      | Extra | Comment
-----------------+---------------+-------+---------
o_orderkey      | bigint        |       |        
o_custkey       | bigint        |       |        
o_orderstatus   | varchar       |       |        
o_totalprice    | decimal(18,2) |       |        
o_orderdate     | date          |       |        
o_orderpriority | varchar       |       |        
o_clerk         | varchar       |       |        
o_shippriority  | integer       |       |        
o_comment       | varchar       |       |        
free_text       | varchar       |       |        
(10 rows)

Query 20250404_220342_00020_eqfxa, FINISHED, 3 nodes
Splits: 8 total, 8 done (100.00%)
0.35 [10 rows, 690B] [28 rows/s, 1.92KiB/s]

Reading an external iceberg table

We also created an external iceberg table from the Databricks workspace and read it from the Trino terminal

trino:irc_demo_schema> select c_custkey, c_name, c_address,c_phone from customers_external_iceberg limit 5;
c_custkey |       c_name       |              c_address               |     c_phone    
-----------+--------------------+--------------------------------------+-----------------
   412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf                 | 31-421-403-4333
   412446 | Customer#000412446 | 5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg   | 30-487-949-7942
   412447 | Customer#000412447 | HC4ZT62gKPgrjr ceoaZgFOunlUogr7GO    | 17-797-466-6308
   412448 | Customer#000412448 | hJok1MMrDgH                          | 16-541-510-4964
   412449 | Customer#000412449 | zAt1nZNG01gOhIqgyDtDa S,Y0VSofZJs1dd | 24-710-983-5536
(5 rows)

Query 20250404_221013_00023_eqfxa, FINISHED, 1 node
Splits: 4 total, 4 done (100.00%)
1.15 [7 rows, 22.4MiB] [6 rows/s, 19.5MiB/s]

Inserting a record into the managed Iceberg table

trino> use databricks_uc_irc_catalog.uc_irc_schema;
USE
trino:uc_irc_schema> select * from customers_managed_iceberg where c_custkey = 412445;
 c_custkey |       c_name       |      c_address       | c_nationkey |     c_phone     | c_acctbal | c_mktsegment |         >
-----------+--------------------+----------------------+-------------+-----------------+-----------+--------------+--------->
    412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf |          21 | 31-421-403-4333 |   5358.33 | BUILDING     | carefully>
(1 row)

Query 20250616_140408_00008_eupkh, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
2.04 [750K rows, 25.7MiB] [368K rows/s, 12.6MiB/s]

trino:uc_irc_schema> insert into customers_managed_iceberg values
                  -> (412445,   'Customer#000412445',   '0QAB3OjYnbP6mA0B,kgf', 21, '31-421-403-4333',  5358.33,    'BUILDING', 'carefully blithely regular epi');
INSERT: 1 row

Query 20250616_140558_00011_eupkh, FINISHED, 2 nodes
Splits: 10 total, 10 done (100.00%)
5.43 [0 rows, 0B] [0 rows/s, 0B/s]

trino:uc_irc_schema> select * from customers_managed_iceberg where c_custkey = 412445;
 c_custkey |       c_name       |      c_address       | c_nationkey |     c_phone     | c_acctbal | c_mktsegment |         >
-----------+--------------------+----------------------+-------------+-----------------+-----------+--------------+--------->
    412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf |          21 | 31-421-403-4333 |   5358.33 | BUILDING     | carefully>
    412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf |          21 | 31-421-403-4333 |   5358.33 | BUILDING     | carefully>
(2 rows)
(END)

Step 5: EMR Spark SQL - Read and Write Managed Iceberg tables

Now that EMR Spark is configured to read tables from the Databricks UC catalog, you can perform SQL queries using the Spark SQL terminal. We will use standard ANSI SQL queries that perform aggregations or join tables to retrieve data. We will also insert data into the managed iceberg table.

Adding Spark SQL CLI setting for better visibility of query results.

spark-sql ()> SET spark.sql.cli.print.header=true; 
key     value
spark.sql.cli.print.header      true
Time taken: 0.039 seconds, Fetched 1 row(s)
spark-sql ()> SET spark.sql.cli.output.format=table; 
key     value
spark.sql.cli.output.format     table
Time taken: 0.023 seconds, Fetched 1 row(s)
spark-sql ()> SHOW CATALOGS;
catalog
databricks_uc_irc_catalog
spark_catalog
Time taken: 0.031 seconds, Fetched 2 row(s)

spark-sql ()> USE databricks_uc_irc_catalog;
Response code
Time taken: 0.031 seconds

spark-sql ()> SHOW SCHEMAS;
namespace
default
information_schema
uc_irc_schema
Time taken: 0.887 seconds, Fetched 3 row(s)


spark-sql ()> USE uc_irc_schema;
Response code
Time taken: 0.249 seconds
spark-sql (uc_irc_schema)>


spark-sql (uc_irc_schema)> show tables;
namespace	tableName	isTemporary
customer
customers_external_iceberg
customers_managed_iceberg
lineitem
lineitem_managed_iceberg
nation
orders
orders_managed_iceberg
part
partsupp
region
supplier
Time taken: 0.286 seconds, Fetched 12 row(s)

Reading data from a managed Iceberg table

spark-sql (uc_irc_schema)> select * from customers_managed_iceberg limit 3;
c_custkey	c_name	c_address	c_nationkey	c_phone	c_acctbal	c_mktsegment	c_comment
412445	Customer#000412445	0QAB3OjYnbP6mA0B,kgf	21	31-421-403-4333	5358.33	BUILDING	arefully blithely regular epi
412446	Customer#000412446	5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg 	20	30-487-949-7942	9441.59	MACHINERYsleep according to the fluffily even forges. fluffily careful packages after the ironic, silent deposi
412447	Customer#000412447	HC4ZT62gKPgrjr ceoaZgFOunlUogr7GO	7	17-797-466-6308	7868.75	AUTOMOBILE	aggle blithely among the carefully express excus
Time taken: 3.125 seconds, Fetched 3 row(s)
spark-sql (uc_irc_schema)>

Inserting a record into the managed Iceberg table from Spark SQL.

spark-sql ()> use uc_irc_schema;
Time taken: 0.313 seconds
spark-sql (uc_irc_schema)> select count(*) from customers_managed_iceberg;
count(1)
750002
Time taken: 1.483 seconds, Fetched 1 row(s)
spark-sql (uc_irc_schema)> insert into customers_managed_iceberg values (412445,'Customer#000412445','0QAB3OjYnbP6mA0B,kgf',21,'31-421-403-4333',5358.33,'BUILDING','arefully blithely regular epi');
Response code
Time taken: 4.425 seconds
spark-sql (uc_irc_schema)> select count(*) from customers_managed_iceberg;
count(1)
750003
Time taken: 0.757 seconds, Fetched 1 row(s)
spark-sql (uc_irc_schema)>

Step 6: Performing UC access control test

Permissions in Unity Catalog (UC) play a critical role in controlling access to the data assets governed by UC. EMR Spark and Trino's ability to query the affected tables is immediately impacted if permissions are revoked from a principal or user associated with the PAT token. For example, revoking SELECT privilege on a table will result in query failure error messages as below. This highlights the importance of carefully managing permissions in Unity Catalog to balance security and operational efficiency.

In these examples below, we first show queries running successfully from both Trino and Spark, when the databricks principal has Select grants on the table (customer_managed_iceberg) being queried. Later, we revoke the principal’s Select permission on the table and show the failures for both Trino and Spark.

When the principal has SELECT granted on the table being queried

Table access is successful in EMR Trino

trino> use databricks_uc_irc_catalog.uc_irc_schema;
USE
trino:uc_irc_schema> select * from customers_managed_iceberg limit 1;
 c_custkey |       c_name       |      c_address       | c_nationkey |     c_phone     | c_acctbal | c_mktsegment |    >
-----------+--------------------+----------------------+-------------+-----------------+-----------+--------------+---->
    412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf |          21 | 31-421-403-4333 |   5358.33 | BUILDING     | are>
(1 row)
(END)

Table access is successful in EMR Spark

spark-sql (uc_irc_schema)> select * from customers_managed_iceberg limit 1;
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
412445	Customer#000412445	0QAB3OjYnbP6mA0B,kgf	21	31-421-403-4333	5358.33	BUILDING	arefully blithely regular epi
Time taken: 5.107 seconds, Fetched 1 row(s)

Let's remove SELECT permission on the customer_managed_iceberg table from the principal

Now that the SELECT grant is missing on the table, we received the following error when projecting all the columns.

Table access fails in EMR Trino

trino:uc_irc_schema> select * from customers_managed_iceberg where c_custkey = 412445;
Query 20250616_140734_00013_eupkh failed: Failed to load table: customers_managed_iceberg in uc_irc_schema namespace

trino:uc_irc_schema> select * from customers_managed_iceberg limit 1;
Query 20250616_140754_00014_eupkh failed: Failed to load table: customers_managed_iceberg in uc_irc_schema namespace

Table access fails in EMR Spark

spark-sql (uc_irc_schema)> select * from customers_managed_iceberg limit 1;
25/06/16 14:18:31 ERROR SparkSQLDriver: Failed in [select * from customers_managed_iceberg limit 1]
org.apache.iceberg.exceptions.ServiceFailureException: Server error: JsonMappingException: No content to map due to end-of-input
 at [Source: ; line: 1, column: 1]
	at org.apache.iceberg.rest.ErrorHandlers$DefaultErrorHandler.accept(ErrorHandlers.java:217) ~[org.apache.iceberg_iceberg-spark-runtime-3.5_2.12-1.6.1.jar:?]
	at org.apache.iceberg.rest.ErrorHandlers$TableErrorHandler.accept(ErrorHandlers.java:118) ~[org.apache.iceberg_iceberg-spark-runtime-3.5_2.12-1.6.1.jar:?]
	at org.apache.iceberg.rest.ErrorHandlers$TableErrorHandler.accept(ErrorHandlers.java:102) ~[org.apache.iceberg_iceberg-spark-runtime-3.5_2.12-1.6.1.jar:?]

This is expected as we have revoked the SELECT grant on the table from the principal, demonstrating how UC credential vending simplifies governance and access control across Databricks and external engines.

Step 7: Using OAuth for connecting to the Unity Catalog

You can use Machine-to-Machine (M2M) OAuth protocol to authenticate and authorize access to Unity Catalog assets from OSS Spark using the following Spark-SQL prompt. Note that the spark.sql.catalog.databricks_uc_irc_catalog.credential is set using<<Client ID>> and <<Client Secret>>. To learn more about how to create the Client ID and Client Secret, you can refer here.

./spark-sql --name "uc-test-oauth-process" \
    --master "local[*]" \
    --packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.1,org.apache.hadoop:hadoop-common:3.4.1,org.apache.hadoop:hadoop-aws:3.4.1" \
    --conf "spark.hadoop.fs.s3.impl=org.apache.iceberg.aws.s3.S3FileIO" \
    --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
    --conf "spark.sql.catalog.databricks_uc_irc_catalog=org.apache.iceberg.spark.SparkCatalog" \
    --conf "spark.sql.catalog.databricks_uc_irc_catalog.catalog-impl=org.apache.iceberg.rest.RESTCatalog" \
    --conf "spark.sql.catalog.databricks_uc_irc_catalog.uri=https://<<Your Workspace URL>>/api/2.1/unity-catalog/iceberg-rest" \
    --conf "spark.sql.catalog.databricks_uc_irc_catalog.oauth2-server-uri=https://<<Your Workspace URL>>/oidc/v1/token" \
    --conf "spark.sql.catalog.databricks_uc_irc_catalog.scope=all-apis" \
    --conf "spark.sql.catalog.databricks_uc_irc_catalog.credential=<<Client ID>>:<<Client Secret>>" \
    --conf "spark.sql.defaultCatalog=databricks_uc_irc_catalog" \
    --conf "spark.sql.catalog.databricks_uc_irc_catalog.warehouse=databricks_uc_irc_catalog"

Conclusion

This blog provided a step-by-step guide on using EMR-based Trino and Spark to securely read and write Iceberg tables registered in Databricks Unity Catalog using the Iceberg REST APIs. By following this approach, you can seamlessly integrate Spark and Trino with the Unity Catalog, enabling interoperability across data platforms while maintaining strong governance and security controls and benefiting from performance enhancements like Liquid Clustering and Predictive Optimization. This ensures that Iceberg tables remain accessible to external engines without compromising data consistency, performance, or compliance.

Try out Unity Catalog’s Iceberg REST APIs today to access your data securely from any external Iceberg REST API-compatible clients. Additionally, you can try performing CRUD operations on your Databricks Unity Catalog data from several external systems. For more details, please refer here.

What’s Next?

Follow the recent announcements about full Apache Iceberg™ support in Databricks. Also, check the Databricks Community blogs to learn how to integrate OSS Apache Spark™ and OSS Trino with your Databricks Unity Catalog assets via open APIs.

Performing DML Operations on Databricks Unity Catalog from AWS EMR