trino:uc_irc_schema> SHOW TABLES;
Table
----------
customers_managed_iceberg
lineitem_managed_iceberg
orders_managed_iceberg
customers_external_iceberg
(4 rows)
Query 20250328_202517_00007_eqfxa, FINISHED, 3 nodes
Splits: 8 total, 8 done (100.00%)
0.38 [3 rows, 97B] [7 rows/s, 256B/s]
trino:uc_irc_schema> SELECT
-> l_returnflag,
-> l_linestatus,
-> SUM(l_quantity) AS sum_qty,
-> SUM(l_extendedprice) AS sum_base_price,
-> SUM(l_extendedprice * (1 - l_discount)) AS sum_disc_price,
-> SUM(l_extendedprice * (1 - l_discount) * (1 + l_tax)) AS sum_charge,
-> AVG(l_quantity) AS avg_qty,
-> AVG(l_extendedprice) AS avg_price,
-> AVG(l_discount) AS avg_disc,
-> COUNT(*) AS count_order
-> FROM
-> lineitem_managed_iceberg
-> WHERE
-> l_shipdate <= DATE '1998-12-01' - INTERVAL '90' DAY
-> GROUP BY
-> l_returnflag,
-> l_linestatus
-> ORDER BY
-> l_returnflag,
-> l_linestatus;
l_returnflag | l_linestatus | sum_qty | sum_base_price | sum_disc_price | sum_charge | avg_qty | avg_price | avg_disc | count_order
--------------+--------------+--------------+-----------------+-------------------+---------------------+---------+-----------+----------+-------------
A | F | 188818373.00 | 283107483036.12 | 268952035589.0630 | 279714361804.228122 | 25.50 | 38237.67 | 0.05 | 7403889
N | F | 4913382.00 | 7364213967.95 | 6995782725.6633 | 7275821143.989585 | 25.53 | 38267.78 | 0.05 | 192439
N | O | 371626663.00 | 557251817321.64 | 529391298998.6177 | 550573649707.749900 | 25.50 | 38233.71 | 0.05 | 14574883
R | F | 188960009.00 | 283310887148.20 | 269147687267.2029 | 279912972474.864338 | 25.51 | 38252.41 | 0.05 | 7406353
(4 rows)
Query 20250328_202610_00008_eqfxa, FINISHED, 2 nodes
Splits: 25 total, 25 done (100.00%)
8.10 [30M rows, 253MiB] [3.7M rows/s, 31.2MiB/s]
trino:uc_irc_schema> SELECT
-> l_orderkey,
-> SUM(l_extendedprice * (1 - l_discount)) AS revenue,
-> o_orderdate,
-> o_shippriority
-> FROM
-> customers_managed_iceberg c
-> JOIN orders_managed_iceberg o ON c.c_custkey = o.o_custkey
-> JOIN lineitem_managed_iceberg l ON l.l_orderkey = o.o_orderkey
-> WHERE
-> c.c_mktsegment = 'BUILDING'
-> AND o.o_orderdate < DATE '1995-03-15'
-> AND l.l_shipdate > DATE '1995-03-15'
-> GROUP BY
-> l_orderkey,
-> o_orderdate,
-> o_shippriority
-> ORDER BY
-> revenue DESC,
-> o_orderdate
-> LIMIT 20;
l_orderkey | revenue | o_orderdate | o_shippriority
------------+-------------+-------------+----------------
18869634 | 541426.1669 | 1995-01-10 | 0
2845094 | 450279.8097 | 1995-03-06 | 0
16716836 | 432402.2306 | 1995-01-28 | 0
25345699 | 431791.2769 | 1995-02-15 | 0
568514 | 423439.7864 | 1995-02-18 | 0
9844418 | 413413.2048 | 1995-02-22 | 0
12783202 | 412259.8661 | 1995-03-07 | 0
25342753 | 411832.8838 | 1995-01-07 | 0
4011108 | 398799.3850 | 1995-03-10 | 0
28708865 | 398227.3287 | 1995-02-15 | 0
1000004 | 397918.5426 | 1995-03-02 | 0
10725381 | 397623.7508 | 1995-03-12 | 0
4860004 | 394207.2591 | 1995-02-22 | 0
16002339 | 393607.8484 | 1995-02-06 | 0
1083941 | 392686.9967 | 1995-02-21 | 0
24062117 | 391641.4971 | 1995-02-08 | 0
2529826 | 389595.2070 | 1995-02-17 | 0
24392391 | 388391.8549 | 1995-02-23 | 0
14444676 | 387978.3972 | 1995-03-11 | 0
12935522 | 386509.0814 | 1995-02-16 | 0
(20 rows)
Query 20250328_202645_00009_eqfxa, FINISHED, 2 nodes
Splits: 45 total, 45 done (100.00%)
6.51 [38.6M rows, 337MiB] [5.92M rows/s, 51.8MiB/s]
The following steps demonstrate the changes in column values queried via the EMR Trino terminal before and after performing DML and DDL operations on the tables using the Databricks workspace. Details are provided in the steps below.
-
Performing DML operations
- Let's examine the value of the address from the Trino terminal, before a DML is performed from the Databricks workspace
trino:irc_demo_schema> select c_address from customers_managed_iceberg where c_custkey = 412446;
c_address
-------------------------------------
5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg
(1 row)
Query 20250404_215551_00012_eqfxa, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
1.86 [750K rows, 10.9MiB] [402K rows/s, 5.85MiB/s]
|
- The following DML statement is executed successfully from the Databricks workspace to update the customer address.
update databricks_uc_irc_catalog.uc_irc_schema.customers_managed_iceberg
set c_address = 'New customer secondary address for customer # 412446'
where c_custkey = 412446;
- Now, the address change is reflected when queried from the Trino terminal.
trino:irc_demo_schema> select c_address from customers_managed_iceberg where c_custkey = 412446;
c_address
------------------------------------------------------
New customer secondary address for customer # 412446
(1 row)
Query 20250404_215917_00013_eqfxa, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
1.01 [750K rows, 8.81MiB] [743K rows/s, 8.72MiB/s]
-
Performing DDL changes
- Table definition is changed from the Databricks workspace, and the change is validated from EMR Trino. Details are provided in the steps below.
- Let's examine the table definition (DDL) of orders from the Trino terminal, before a DDL change is performed from the Databricks workspace.
trino:irc_demo_schema> describe orders_managed_iceberg;
Column | Type | Extra | Comment
-----------------+---------------+-------+---------
o_orderkey | bigint | |
o_custkey | bigint | |
o_orderstatus | varchar | |
o_totalprice | decimal(18,2) | |
o_orderdate | date | |
o_orderpriority | varchar | |
o_clerk | varchar | |
o_shippriority | integer | |
o_comment | varchar | |
(9 rows)
Query 20250404_220151_00017_eqfxa, FINISHED, 3 nodes
Splits: 8 total, 8 done (100.00%)
0.58 [9 rows, 624B] [15 rows/s, 1.05KiB/s]
- A DDL statement is executed successfully from the Databricks workspace to add a new column to the orders table.
ALTER TABLE databricks_uc_irc_catalog.uc_irc_schema.orders_managed_iceberg
ADD COLUMNS ( free_text STRING);
- Now, the Trino terminal reflects the table definition change when the orders table is described
trino:irc_demo_schema> describe orders_managed_iceberg;
Column | Type | Extra | Comment
-----------------+---------------+-------+---------
o_orderkey | bigint | |
o_custkey | bigint | |
o_orderstatus | varchar | |
o_totalprice | decimal(18,2) | |
o_orderdate | date | |
o_orderpriority | varchar | |
o_clerk | varchar | |
o_shippriority | integer | |
o_comment | varchar | |
free_text | varchar | |
(10 rows)
Query 20250404_220342_00020_eqfxa, FINISHED, 3 nodes
Splits: 8 total, 8 done (100.00%)
0.35 [10 rows, 690B] [28 rows/s, 1.92KiB/s]
-
Reading an external iceberg table
- We also created an external iceberg table from the Databricks workspace and read it from the Trino terminal
trino:irc_demo_schema> select c_custkey, c_name, c_address,c_phone from customers_external_iceberg limit 5;
c_custkey | c_name | c_address | c_phone
-----------+--------------------+--------------------------------------+-----------------
412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf | 31-421-403-4333
412446 | Customer#000412446 | 5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg | 30-487-949-7942
412447 | Customer#000412447 | HC4ZT62gKPgrjr ceoaZgFOunlUogr7GO | 17-797-466-6308
412448 | Customer#000412448 | hJok1MMrDgH | 16-541-510-4964
412449 | Customer#000412449 | zAt1nZNG01gOhIqgyDtDa S,Y0VSofZJs1dd | 24-710-983-5536
(5 rows)
Query 20250404_221013_00023_eqfxa, FINISHED, 1 node
Splits: 4 total, 4 done (100.00%)
1.15 [7 rows, 22.4MiB] [6 rows/s, 19.5MiB/s]
-
-
Inserting a record into the managed Iceberg table
trino> use databricks_uc_irc_catalog.uc_irc_schema;
USE
trino:uc_irc_schema> select * from customers_managed_iceberg where c_custkey = 412445;
c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | >
-----------+--------------------+----------------------+-------------+-----------------+-----------+--------------+--------->
412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf | 21 | 31-421-403-4333 | 5358.33 | BUILDING | carefully>
(1 row)
Query 20250616_140408_00008_eupkh, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
2.04 [750K rows, 25.7MiB] [368K rows/s, 12.6MiB/s]
trino:uc_irc_schema> insert into customers_managed_iceberg values
-> (412445, 'Customer#000412445', '0QAB3OjYnbP6mA0B,kgf', 21, '31-421-403-4333', 5358.33, 'BUILDING', 'carefully blithely regular epi');
INSERT: 1 row
Query 20250616_140558_00011_eupkh, FINISHED, 2 nodes
Splits: 10 total, 10 done (100.00%)
5.43 [0 rows, 0B] [0 rows/s, 0B/s]
trino:uc_irc_schema> select * from customers_managed_iceberg where c_custkey = 412445;
c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | >
-----------+--------------------+----------------------+-------------+-----------------+-----------+--------------+--------->
412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf | 21 | 31-421-403-4333 | 5358.33 | BUILDING | carefully>
412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf | 21 | 31-421-403-4333 | 5358.33 | BUILDING | carefully>
(2 rows)
(END)
Step 5: EMR Spark SQL - Read and Write Managed Iceberg tables
Now that EMR Spark is configured to read tables from the Databricks UC catalog, you can perform SQL queries using the Spark SQL terminal. We will use standard ANSI SQL queries that perform aggregations or join tables to retrieve data. We will also insert data into the managed iceberg table.
Adding Spark SQL CLI setting for better visibility of query results.
spark-sql ()> SET spark.sql.cli.print.header=true;
key value
spark.sql.cli.print.header true
Time taken: 0.039 seconds, Fetched 1 row(s)
spark-sql ()> SET spark.sql.cli.output.format=table;
key value
spark.sql.cli.output.format table
Time taken: 0.023 seconds, Fetched 1 row(s)
spark-sql ()> SHOW CATALOGS;
catalog
databricks_uc_irc_catalog
spark_catalog
Time taken: 0.031 seconds, Fetched 2 row(s)
spark-sql ()> USE databricks_uc_irc_catalog;
Response code
Time taken: 0.031 seconds
spark-sql ()> SHOW SCHEMAS;
namespace
default
information_schema
uc_irc_schema
Time taken: 0.887 seconds, Fetched 3 row(s)
spark-sql ()> USE uc_irc_schema;
Response code
Time taken: 0.249 seconds
spark-sql (uc_irc_schema)>
spark-sql (uc_irc_schema)> show tables;
namespace tableName isTemporary
customer
customers_external_iceberg
customers_managed_iceberg
lineitem
lineitem_managed_iceberg
nation
orders
orders_managed_iceberg
part
partsupp
region
supplier
Time taken: 0.286 seconds, Fetched 12 row(s)
Reading data from a managed Iceberg table
spark-sql (uc_irc_schema)> select * from customers_managed_iceberg limit 3;
c_custkey c_name c_address c_nationkey c_phone c_acctbal c_mktsegment c_comment
412445 Customer#000412445 0QAB3OjYnbP6mA0B,kgf 21 31-421-403-4333 5358.33 BUILDING arefully blithely regular epi
412446 Customer#000412446 5u8MSbyiC7J,7PuY4Ivaq1JRbTCMKeNVqg 20 30-487-949-7942 9441.59 MACHINERYsleep according to the fluffily even forges. fluffily careful packages after the ironic, silent deposi
412447 Customer#000412447 HC4ZT62gKPgrjr ceoaZgFOunlUogr7GO 7 17-797-466-6308 7868.75 AUTOMOBILE aggle blithely among the carefully express excus
Time taken: 3.125 seconds, Fetched 3 row(s)
spark-sql (uc_irc_schema)>
Inserting a record into the managed Iceberg table from Spark SQL.
spark-sql ()> use uc_irc_schema;
Time taken: 0.313 seconds
spark-sql (uc_irc_schema)> select count(*) from customers_managed_iceberg;
count(1)
750002
Time taken: 1.483 seconds, Fetched 1 row(s)
spark-sql (uc_irc_schema)> insert into customers_managed_iceberg values (412445,'Customer#000412445','0QAB3OjYnbP6mA0B,kgf',21,'31-421-403-4333',5358.33,'BUILDING','arefully blithely regular epi');
Response code
Time taken: 4.425 seconds
spark-sql (uc_irc_schema)> select count(*) from customers_managed_iceberg;
count(1)
750003
Time taken: 0.757 seconds, Fetched 1 row(s)
spark-sql (uc_irc_schema)>
Step 6: Performing UC access control test
Permissions in Unity Catalog (UC) play a critical role in controlling access to the data assets governed by UC. EMR Spark and Trino's ability to query the affected tables is immediately impacted if permissions are revoked from a principal or user associated with the PAT token. For example, revoking SELECT privilege on a table will result in query failure error messages as below. This highlights the importance of carefully managing permissions in Unity Catalog to balance security and operational efficiency.
In these examples below, we first show queries running successfully from both Trino and Spark, when the databricks principal has Select grants on the table (customer_managed_iceberg) being queried. Later, we revoke the principal’s Select permission on the table and show the failures for both Trino and Spark.
When the principal has SELECT granted on the table being queried
Table access is successful in EMR Trino
trino> use databricks_uc_irc_catalog.uc_irc_schema;
USE
trino:uc_irc_schema> select * from customers_managed_iceberg limit 1;
c_custkey | c_name | c_address | c_nationkey | c_phone | c_acctbal | c_mktsegment | >
-----------+--------------------+----------------------+-------------+-----------------+-----------+--------------+---->
412445 | Customer#000412445 | 0QAB3OjYnbP6mA0B,kgf | 21 | 31-421-403-4333 | 5358.33 | BUILDING | are>
(1 row)
(END)
|
Table access is successful in EMR Spark
spark-sql (uc_irc_schema)> select * from customers_managed_iceberg limit 1;
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
412445 Customer#000412445 0QAB3OjYnbP6mA0B,kgf 21 31-421-403-4333 5358.33 BUILDING arefully blithely regular epi
Time taken: 5.107 seconds, Fetched 1 row(s)
|
Let's remove SELECT permission on the customer_managed_iceberg table from the principal
Now that the SELECT grant is missing on the table, we received the following error when projecting all the columns.
Table access fails in EMR Trino
trino:uc_irc_schema> select * from customers_managed_iceberg where c_custkey = 412445;
Query 20250616_140734_00013_eupkh failed: Failed to load table: customers_managed_iceberg in uc_irc_schema namespace
trino:uc_irc_schema> select * from customers_managed_iceberg limit 1;
Query 20250616_140754_00014_eupkh failed: Failed to load table: customers_managed_iceberg in uc_irc_schema namespace
Table access fails in EMR Spark
spark-sql (uc_irc_schema)> select * from customers_managed_iceberg limit 1;
25/06/16 14:18:31 ERROR SparkSQLDriver: Failed in [select * from customers_managed_iceberg limit 1]
org.apache.iceberg.exceptions.ServiceFailureException: Server error: JsonMappingException: No content to map due to end-of-input
at [Source: ; line: 1, column: 1]
at org.apache.iceberg.rest.ErrorHandlers$DefaultErrorHandler.accept(ErrorHandlers.java:217) ~[org.apache.iceberg_iceberg-spark-runtime-3.5_2.12-1.6.1.jar:?]
at org.apache.iceberg.rest.ErrorHandlers$TableErrorHandler.accept(ErrorHandlers.java:118) ~[org.apache.iceberg_iceberg-spark-runtime-3.5_2.12-1.6.1.jar:?]
at org.apache.iceberg.rest.ErrorHandlers$TableErrorHandler.accept(ErrorHandlers.java:102) ~[org.apache.iceberg_iceberg-spark-runtime-3.5_2.12-1.6.1.jar:?]
This is expected as we have revoked the SELECT grant on the table from the principal, demonstrating how UC credential vending simplifies governance and access control across Databricks and external engines.
Step 7: Using OAuth for connecting to the Unity Catalog
You can use Machine-to-Machine (M2M) OAuth protocol to authenticate and authorize access to Unity Catalog assets from OSS Spark using the following Spark-SQL prompt. Note that the spark.sql.catalog.databricks_uc_irc_catalog.credential is set using<<Client ID>> and <<Client Secret>>. To learn more about how to create the Client ID and Client Secret, you can refer here.
./spark-sql --name "uc-test-oauth-process" \
--master "local[*]" \
--packages "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.1,org.apache.hadoop:hadoop-common:3.4.1,org.apache.hadoop:hadoop-aws:3.4.1" \
--conf "spark.hadoop.fs.s3.impl=org.apache.iceberg.aws.s3.S3FileIO" \
--conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \
--conf "spark.sql.catalog.databricks_uc_irc_catalog=org.apache.iceberg.spark.SparkCatalog" \
--conf "spark.sql.catalog.databricks_uc_irc_catalog.catalog-impl=org.apache.iceberg.rest.RESTCatalog" \
--conf "spark.sql.catalog.databricks_uc_irc_catalog.uri=https://<<Your Workspace URL>>/api/2.1/unity-catalog/iceberg-rest" \
--conf "spark.sql.catalog.databricks_uc_irc_catalog.oauth2-server-uri=https://<<Your Workspace URL>>/oidc/v1/token" \
--conf "spark.sql.catalog.databricks_uc_irc_catalog.scope=all-apis" \
--conf "spark.sql.catalog.databricks_uc_irc_catalog.credential=<<Client ID>>:<<Client Secret>>" \
--conf "spark.sql.defaultCatalog=databricks_uc_irc_catalog" \
--conf "spark.sql.catalog.databricks_uc_irc_catalog.warehouse=databricks_uc_irc_catalog"
Conclusion
This blog provided a step-by-step guide on using EMR-based Trino and Spark to securely read and write Iceberg tables registered in Databricks Unity Catalog using the Iceberg REST APIs. By following this approach, you can seamlessly integrate Spark and Trino with the Unity Catalog, enabling interoperability across data platforms while maintaining strong governance and security controls and benefiting from performance enhancements like Liquid Clustering and Predictive Optimization. This ensures that Iceberg tables remain accessible to external engines without compromising data consistency, performance, or compliance.
Try out Unity Catalog’s Iceberg REST APIs today to access your data securely from any external Iceberg REST API-compatible clients. Additionally, you can try performing CRUD operations on your Databricks Unity Catalog data from several external systems. For more details, please refer here.
What’s Next?
Follow the recent announcements about full Apache Iceberg™ support in Databricks. Also, check the Databricks Community blogs to learn how to integrate OSS Apache Spark™ and OSS Trino with your Databricks Unity Catalog assets via open APIs.
|