cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
anhhchu
Databricks Employee
Databricks Employee

Data practitioners often struggle to choose between open data formats like Apache Iceberg and Delta Lake (Linux Foundation) due to inconsistent support across different data platforms. To address this, Databricks introduced Uniform in 2023 and acquired Tabular (the creator of Iceberg) in 2024.  Uniform enables interoperability between open data formats (Delta, Iceberg and Hudi), allowing data practitioners to access table data in cloud object storage (S3, ADLS, GCS) regardless of which data platform they choose and governed with Unity Catalog. 

This article explores how Snowflake can access Delta tables governed by Unity Catalog using 3 methods below:

  1. Read Iceberg with Unity Catalog credential vending without external volumes: available for Managed + external UC tables in AWS, only managed UC table in Azure and GCP
  2. Read Iceberg with Unity Catalog credential vending and external volume  (Managed + External UC tables on all clouds)
  3. Read Delta as external Iceberg table with object storage catalog integrations (Managed+External UC tables on all clouds)

Enabling Iceberg Reads in Databricks

As Uniform is only a feature to enable format interoperability, for the remainder of this document, Delta tables with Iceberg Uniform will be referred to as tables with Iceberg reads.

In the Databricks workspace, Iceberg reads can be enabled on an existing Delta table or on a new Delta table in Unity Catalog. This is a 2-step process:

  1. Enable Iceberg reads on Delta table
  2. Set up Unity Catalog for Iceberg REST API

Requirements

Step 1: Enable Iceberg reads on a Delta table

Create a Delta Table with Iceberg reads

In Databricks, we can enable Iceberg reads on a Unity Catalog managed or external table using table properties:

create or replace table ahc_demos.tpcds_sf_100.store_sales

tblproperties(

 'delta.columnMapping.mode' = 'name',

 'delta.enableIcebergCompatV2' = 'true',

 'delta.universalFormat.enabledFormats' = 'iceberg')

as select * from tpcds.sf_100.store_sales

After the table is created, Iceberg metadata will be created along with Delta metadata, which allows the table to be read by both Delta and Iceberg readers.  The location of the Iceberg metadata JSON file can be found by running explain detail <table_name> or by viewing the table details from Catalog Explorer view. Similar to Delta metadata, Iceberg uses this metadata file to store information about a table schema, partition information, and the snapshot details for table versioning. 

anhhchu_0-1744999735896.png

In the cloud storage where the table is saved, in addition to the _delta_log folder which contains Delta metadata, there's also a metadata folder containing Iceberg metadata. This folder has multiple versions of JSON metadata files, as well as Avro manifest files. The Iceberg metadata JSON file adheres to the format <version>-<uuid>-metadata.json, where the latest version represents the most recent stage of the table. Avro manifest files store a list of data files along with column-level metrics, partitions, schema, and stats. In Iceberg, these are used for file pruning and table versioning for time travel.

anhhchu_1-1744999735894.png

Data files are also compatible with both Delta and Iceberg formats. For example, while Delta tables use snappy parquet files, tables with Iceberg reads use z-standard (Zstd) compression for Iceberg compatibility. The choice of Zstd also helps with storage cost and Zstd has a higher compression ratio and smaller file sizes. Note that if you add Iceberg table properties after the table has been created, the existing data files won't be rewritten from Snappy to Zstd. However, any files written after the properties are enabled will use Zstd

anhhchu_2-1744999735892.png

Improving Delta Read Performance with Liquid Clustering

Liquid Clustering is a Delta feature that doesn't influence Iceberg read performance, while still providing the performance boost for Delta readers. Tables with Iceberg reads do not support Deletion Vectors (this will change in Iceberg v3), and as a result, Deletion Vectors are not auto-enabled with liquid clustering.

The rule is to choose columns that are frequently used as filtering or join keys as liquid clustering columns ( s_customer_sk, s_sold_date_sk in this example). 

After that, run optimize to trigger the clustering of data, this step is required to ensure optimal data layout according to the clustering key

alter table ahc_demos.tpcds_sf_100.store_sales

cluster by (s_customer_sk, s_sold_date_sk);

—- run optimize to trigger clustering

optimize ahc_demos.tpcds_sf_100.store_sales;

 

Enable Iceberg Reads on DLT tables

Previously, you could not enable Iceberg reads on tables created in DLT pipelines. However, Databricks' recently released DLT sink API (currently in Public Preview) allows you to write to external tables instead of DLT-managed tables. This means that you can enable additional features/properties on the target external table outside of the DLT pipeline, including Iceberg reads table properties. 

This feature is particularly useful when you want to build the ETL pipelines in Databricks, and then enable Iceberg reads on the gold layer for downstream serving in external data systems including Snowflake

To enable this feature, you can specify the DLT pipeline (see DLT sink API ). The gold table will be a managed UC table with a fully qualified name. 

import dlt

from pyspark.sql.functions import expr

dlt.create_sink(

 name = "clickstream_dlt",

 format = "delta",

 options = {"tableName": "ahc_demos.streaming.clickstream_gold"}

)

@dlt.append_flow(name = "clickstream_gold_flow", target="clickstream_dlt")

def spark_referrers():

return (

  spark.readStream.table("clickstream_clean")

    .filter(expr("current_page_title == 'Apache_Spark'"))

    .withColumnRenamed("previous_page_title", "referrer")

    .select("referrer", "current_page_id", "current_page_title", "click_count")

)

anhhchu_3-1744999735896.png

After that, you can enable Iceberg reads on the table by executing an ALTER STATEMENT. Iceberg metadata will be generated on the table similar a regular Delta table

alter table ahc_demos.streaming.clickstream_gold

set TBLPROPERTIES(

 'delta.columnMapping.mode' = 'name',

 'delta.enableIcebergCompatV2' = 'true',

 'delta.universalFormat.enabledFormats' = 'iceberg');

 

The table can then be queried in Snowflake:

anhhchu_4-1744999735891.png

Step 2: Set up Unity Catalog for Iceberg REST API

Unity Catalog enables table reads for Iceberg clients by exposing an Iceberg REST API. From the Databricks workspace, we can enable access by performing the following steps:

  1. Enable External data access for your metastore ( require metastore admin permission). See Enable external data access on the metastore.

anhhchu_5-1744999735895.png

 

  1. Grant the user or service principal the EXTERNAL USE SCHEMA privilege on the schema containing the tables. See Grant a principal EXTERNAL USE SCHEMA.

Option 1: Grant permission to a user (if using Databricks PAT authentication)

GRANT EXTERNAL USE SCHEMA ON SCHEMA ahc_demos.tpcds_sf_100 TO `a***@databricks.com`

Option 2: Grant permission to a service principal (if using OAuth authentication)

GRANT EXTERNAL USE SCHEMA ON SCHEMA ahc_demos.tpcds_sf_100 TO `<application_id>`

The service principal application Id can be found in the Workspace setting/Identity and access/service principals on Databricks workspace

Read data In Snowflake

From your Snowflake account, we will query the UniForm tables created above as Iceberg external tables using Unity Catalog as remote Iceberg REST Catalog: 

  1. Create a catalog integration with Unity Catalog using Iceberg Rest API
  2. Create an External volume (only applicable to external UC tables)
  3. Create an external Iceberg table

Step 1: Create a catalog integration using Iceberg Rest Catalog

For Snowflake to authenticate with the Unity Catalog metastores with Catalog Integration, we can use OAuth and PAT created in Databricks (See Authorizing access to Databricks resources). Authenticate using a Databricks PAT (if authenticated with an individual user) or OAuth (if authenticated with Service Principal). In production, it's recommended to use OAuth instead of PAT.

In the Snowflake account, create or open a SQL worksheet, then run the below query to create catalog integration:

OAuth option

PAT option

CREATE OR REPLACE CATALOG INTEGRATION uc_int_oauth

  CATALOG_SOURCE = ICEBERG_REST

  TABLE_FORMAT = ICEBERG

  CATALOG_NAMESPACE = 'default'

  REST_CONFIG = (

    CATALOG_URI = 'https://<databricks_workspace_url>/api/2.1/unity-catalog/iceberg'

    WAREHOUSE = 'ahc_demos',

    ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS

  )

  REST_AUTHENTICATION = (

    TYPE = OAUTH

    OAUTH_TOKEN_URI = 'https://<databricks_workspace_url>/oidc/v1/token'

    OAUTH_CLIENT_ID = '69d2fa6d-**********'

    OAUTH_CLIENT_SECRET = 'dose**********'

    OAUTH_ALLOWED_SCOPES = ('all-apis', 'sql')

  )

  ENABLED = TRUE;

  • OAUTH_CLIENT_ID: Service principal client ID from Databricks
  • OAUTH_CLIENT_SECRET: Service principal client secret from Databricks
CREATE OR REPLACE CATALOG INTEGRATION uc_int_pat

  CATALOG_SOURCE = ICEBERG_REST

  TABLE_FORMAT = ICEBERG

  CATALOG_NAMESPACE = 'default'

  REST_CONFIG = (

    CATALOG_URI = 'https://<databricks_workspace_url>/api/2.1/unity-catalog/iceberg',

    WAREHOUSE = 'ahc_demos',

    ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS

  )

  REST_AUTHENTICATION = (

    TYPE = BEARER

    BEARER_TOKEN = 'dapi****************'

  )

  ENABLED = TRUE;


  • BEARER_TOKEN: PAT token of the user in Databricks

Note: WAREHOUSE is the Unity catalog name in Databricks, NOT the name of the Snowflake Warehouse

After that, run the below query to verify the catalog integration

SELECT SYSTEM$VERIFY_CATALOG_INTEGRATION('uc_int_pat');

anhhchu_6-1744999735894.png

Note: 

When OAuth Secret or PAT token expires, the REST AUTHENTICATION within the Catalog Integration cannot be altered (see Alter Catalog Integration)

We also cannot drop or replace a Snowflake catalog integration if one or more Iceberg tables are associated with it, as these tables depend on the catalog integration for metadata and schema information.  

If the tokens expire, you will have to create a new catalog integration and recreate the Iceberg tables to point to the new catalog integration. 

Step 2: Create an External volume 

This step is not required if ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS is specified in Catalog Integration for UC-managed tables (and UC-external tables in AWS).  But it's necessary for UC external tables in Azure and GCP (See Use catalog-vended credentials for Apache Iceberg tables

Below is an example using AWS S3:

CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume

   STORAGE_LOCATIONS =

      (

         (

            NAME = 'ahc-demo-s3-west'

            STORAGE_PROVIDER = 'S3'

            STORAGE_BASE_URL = 's3://ahc-demo-s3-west/'

            STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::*******:role/ahc-role-snowflake'

            STORAGE_AWS_EXTERNAL_ID = 'snowflake_******'

         )

      );



desc external volume iceberg_external_volume;

Verify the external volume connectivity:

SELECT SYSTEM$VERIFY_EXTERNAL_VOLUME('iceberg_external_volume');

Step 3: Create an Iceberg table

CREATE ICEBERG TABLE tpcds_store_sales_iceberg

  CATALOG = uc_int_oauth

  CATALOG_NAMESPACE = 'tpcds_sf_100'

  CATALOG_TABLE_NAME = 'store_sales'

  AUTO_REFRESH = TRUE

  -- uncomment below if you use external UC tables

  -- EXTERNAL_VOLUME = iceberg_external_volume

anhhchu_7-1744999735839.png

  • catalog: name of the unity catalog integration created above
  • catalog_namespace: name of the unity catalog schema in Databricks
  • catalog_table_name: name of the table in Databricks

Create external Iceberg tables with Delta metadata 

(See Create Iceberg table with Delta files in object storage

Snowflake also has the capability to directly read Delta tables from object storage by utilizing object store catalog integration. However, this method circumvents Unity Catalog through the Iceberg Rest Catalog, which breaks Unity Catalog governance.

The following code examples demonstrate how to use Delta metadata to create external Iceberg tables in Snowflake. To do this, we need to create an Iceberg external table based on an Object Store catalog integration with Delta as table format and authenticate using external volume. 

CREATE OR REPLACE CATALOG INTEGRATION delta_catalog_integration

 CATALOG_SOURCE = OBJECT_STORE

 TABLE_FORMAT = DELTA

 ENABLED = TRUE

 REFRESH_INTERVAL_SECONDS = 86400;



CREATE OR REPLACE EXTERNAL VOLUME iceberg_external_volume

  STORAGE_LOCATIONS =

     (

        (

           NAME = 'ahc-demo-s3-west'

           STORAGE_PROVIDER = 'S3'

           STORAGE_BASE_URL = 's3://ahc-demo-s3-west/'

           STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::332745928618:role/ahc-role-snowflake'

           STORAGE_AWS_EXTERNAL_ID = 'snowflake_iceberg_external_id'

        )

     );



CREATE ICEBERG TABLE customer

 BASE_LOCATION = 'customer'

 EXTERNAL_VOLUME = 'iceberg_external_volume'

 CATALOG = 'delta_catalog_integration';

 

Using this option, Snowflake's query plan will have an additional task to process Delta Data Files 

anhhchu_8-1744999735711.png

Conclusion

These methods highlight the strong interoperability between Delta tables and Snowflake and emphasize the ability of users to leverage Unity Catalog to govern a single copy of their data and access it seamlessly across both platforms.

This capability enables organizations to leverage the strengths of both Delta Lake and Iceberg, optimizing their data architecture for performance, scalability, and cost-efficiency.