cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
dkushari
Databricks Employee
Databricks Employee

Summary

  • External engines — Apache Spark™ (batch and Structured Streaming), DuckDB, Apache Flink, Starburst, and StreamNative Kafka Service — can now create, read, and write Unity Catalog-managed Delta tables from outside Databricks, with every commit coordinated by Unity Catalog.
  • UC serializes commits across engines, so concurrent external writers don't corrupt the transaction log; every operation is auditable; and Predictive Optimization continues to manage table health (compaction, vacuuming, statistics)
  • Ships with Delta Lake 4.2.0 and Unity Catalog 0.4.1. Grant EXTERNAL_USE_SCHEMA, authenticate with M2M OAuth, and external clients are up and running in minutes — reading from and writing to Unity Catalog managed Delta tables through the Unity REST API.

The open data lakehouse is built on a simple principle: maintain a single copy of your data, govern it centrally, and access it from the engine best suited to each workload. Previously, we launched UC Open APIs and credential vending, enabling external engines to access tables in Unity Catalog. Now, we are evolving this vision to safely create or write to them.

The Beta of External Access to Unity Catalog Managed Tables with Catalog Commits extends Unity Catalog’s open APIs to deliver full read and write interoperability for managed Delta tables from external engines. External engines can now create, read, and write to these tables, while Unity Catalog coordinates every transaction to ensure consistency, governance, and safe multi-engine operation.

Enterprises can bring the best engine to every workload and benefit from price/performance optimizations –without moving or duplicating data. This further establishes Unity Catalog as the industry's most open and interoperable catalog: one copy of data, governed centrally, and accessible from any engine, without requiring separate copies for different platforms or workloads.

In this blog, we walk through how to perform DML operations on Unity Catalog managed Delta tables from external engines using Apache Spark™ and DuckDB. You can find the complete code examples in this GitHub repository.

Scope note. This Beta is for managed Delta tables only. Managed Iceberg tables already support external writes through Iceberg REST in Public Preview, so both open formats under Unity Catalog now offer full external read, write, and create from outside Databricks.

Scope of this walkthrough. The Beta currently ships connector integrations for five engines: Apache Spark™, DuckDB, Apache Flink, Starburst, and StreamNative Kafka Service. The hands-on examples below cover Apache Spark™, DuckDB, and Apache Flink end to end. Starburst and StreamNative Kafka Service use the same Delta Kernel + catalog-commits substrate, so the integration is identical; see their connector docs for SDK-specific setup.

How Unity Catalog Enables Safe External Writes

The Challenge of Coordinating Writes

Delta tables track every change in an ordered transaction log. When two writers append concurrently without a coordinator, they can pick the same log version, corrupt the log, or silently overwrite each other. Reads alone are safe—they're snapshot-isolated by design—but writes from multiple external engines on the same table were not.

Catalog Commits: A Centralized Commit Coordinator

Catalog commits solve this by making Unity Catalog the commit coordinator. Every write—regardless of which engine initiated it—flows through Unity Catalog, which serializes commits, enforces access controls, and records each operation for audit. Because Unity Catalog sees every external operation, Predictive Optimization continues to maintain table health even when tables are written entirely by engines outside Databricks. Catalog commits also lay the foundation for capabilities that require centralized coordination, such as multi-table transactions.

Delta Kernel: The Foundation for Open Connectivity

The open source library that enables connector authors to make this possible is Delta Kernel—a set of Java and Rust APIs for reading, writing, and committing to Delta tables. Delta Kernel abstracts the low-level Delta protocol, so new connectors need only integrate it into their engine's I/O layer. Apache Spark™ integration (unitycatalog-spark), Apache Flink, StreamNative, and the DuckDB integration (unity_catalog core extension) are built on Delta Kernel.

Authentication: M2M OAuth

External clients authenticate to Databricks using machine-to-machine (M2M) OAuth with a service principal. Personal Access Tokens (PATs) are also supported, but they're legacy, per-user, long-lived, and difficult to rotate. 

The Databricks SDK fetches a short-lived access token from the Service Principal's client credentials:

from databricks.sdk.core import Config, oauth_service_principal

def get_oauth_token() -> str:
    config = Config(
        host=DATABRICKS_HOST,
        client_id=DATABRICKS_CLIENT_ID,
        client_secret=DATABRICKS_CLIENT_SECRET,
    )
    token_source = oauth_service_principal(config)
    return token_source().access_token

For the Spark connector, you don't need to mint a token yourself — the connector accepts the SP's client credentials and refreshes its own tokens against /oidc/v1/token whenever they expire. This is the production path for long-running pipelines (especially Structured Streaming).

Setting up the demo catalog

From a Databricks workspace, create a catalog and schema. For this blog, we have cloned eight samples.tpch tables as UC-managed Delta tables into the schema, and granted EXTERNAL_USE_SCHEMA on the schema to the Service Principal that the external engines will authenticate as. Full DDL is in 00_setup_databricks.sql; the essentials:

DROP CATALOG IF EXISTS uc_ext_access_demo CASCADE;
CREATE CATALOG uc_ext_access_demo;
CREATE SCHEMA  uc_ext_access_demo.tpch_managed;
USE CATALOG uc_ext_access_demo;
USE SCHEMA  tpch_managed;
CREATE OR REPLACE TABLE customer TBLPROPERTIES ('delta.feature.catalogManaged' = 'supported') AS SELECT * FROM samples.tpch.customer;
CREATE OR REPLACE TABLE orders TBLPROPERTIES ('delta.feature.catalogManaged' = 'supported') AS SELECT * FROM samples.tpch.orders;
-- repeat for lineitem, nation, part, partsupp, region, supplier
GRANT USE CATALOG          ON CATALOG uc_ext_access_demo                   TO `<SP_CLIENT_ID>`;
GRANT USE SCHEMA, SELECT, MODIFY, CREATE TABLE, EXTERNAL_USE_SCHEMA  ON SCHEMA  uc_ext_access_demo.tpch_managed TO `<SP_CLIENT_ID>`;

The DROP CATALOG IF EXISTS … CASCADE at the start makes this idempotent — every run wipes prior state and reseeds, so the demo always starts from a known baseline.

DuckDB — READ and INSERT

DuckDB connects to Unity Catalog via two core extensions — unity_catalog (Databricks UC integration) and delta (Kernel-based Delta reader/writer). Both are installed and loaded at runtime, then a single anonymous UNITY_CATALOG secret carries the OAuth token, and the ATTACH integrates the catalog into DuckDB's namespace:

con = duckdb.connect()
con.execute("INSTALL unity_catalog; LOAD unity_catalog;")
con.execute("INSTALL delta; LOAD delta;")
con.execute(f"""
    CREATE SECRET (
        TYPE UNITY_CATALOG,
        TOKEN     '{get_oauth_token()}',
        ENDPOINT  '{DATABRICKS_HOST}/',
        AWS_REGION '{AWS_REGION}'
    );
""")

con.execute("""
    ATTACH 'uc_ext_access_demo' AS uc_ext_access_demo
        (TYPE UNITY_CATALOG);
""")

Two details that the official `unity_catalog` extension docs emphasize: the ENDPOINT is the workspace root URL (not <host>/api/2.1/unity-catalog), and the SECRET is created without a name, so the extension auto-resolves it by type when ATTACH runs without an explicit SECRET reference.

The cloned TPCH tables behave like native DuckDB tables. Read, scan, JOIN — all the standard DuckDB SQL surface works against managed Delta tables:

con.execute("""
    SELECT n.n_name, count(*) AS customer_count
    FROM uc_ext_access_demo.tpch_managed.customer c
    JOIN uc_ext_access_demo.tpch_managed.nation   n
      ON c.c_nationkey = n.n_nationkey
    GROUP BY n.n_name
    ORDER BY customer_count DESC
    LIMIT 5
""").fetchdf()

And with INSERT INTO support, DuckDB can now append rows directly to existing Unity Catalog managed Delta tables. UC serializes the commit, the Delta extension runs the actual write through the kernel, and Spark / Databricks readers see the new rows immediately:

orders = "uc_ext_access_demo.tpch_managed.orders"

# Single-row VALUES — three rows in one statement
con.execute(f"""
    INSERT INTO {orders} (
        o_orderkey, o_custkey, o_orderstatus, o_totalprice,
        o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment
    ) VALUES
        (9000000301, 1, 'O', 100.00, DATE '2026-04-23', '3-MEDIUM',
         'Clerk#external-duckdb', 0, 'duckdb single-row insert'),
        (9000000302, 2, 'O', 200.00, DATE '2026-04-23', '3-MEDIUM',
         'Clerk#external-duckdb', 0, 'duckdb single-row insert'),
        (9000000303, 3, 'O', 300.00, DATE '2026-04-23', '5-LOW',
         'Clerk#external-duckdb', 0, 'duckdb single-row insert');
""")

# Bulk insert from another UC table
con.execute(f"""
    INSERT INTO {orders}
    SELECT
        o_orderkey + 9_000_000_400 AS o_orderkey,
        o_custkey, o_orderstatus, o_totalprice, o_orderdate,
        o_orderpriority,
        'Clerk#external-duckdb-bulk' AS o_clerk,
        o_shippriority,
        'duckdb bulk insert' AS o_comment
    FROM {orders}
    WHERE o_orderkey BETWEEN 11396166 AND 11396175
""")

The DuckDB unity_catalog extension is currently focused on the read and INSERT path against Databricks UC; the extension docs are the source of truth for what's supported as the integration evolves.

Apache Spark™ — batch read

The Spark connector for Unity Catalog (unitycatalog-spark) is the reference implementation for catalog commits. Configure a Spark session outside Databricks, point it at your workspace, and let the connector handle the OAuth flow:

spark = (
    SparkSession.builder.appName("uc-external-access-demo")
    .config("spark.jars.packages",
            "io.delta:delta-spark_2.13:4.2.0,"
            "io.unitycatalog:unitycatalog-spark_2.13:0.4.1,"
            "org.apache.hadoop:hadoop-aws:3.4.2")
    .config("spark.sql.extensions",
            "io.delta.sql.DeltaSparkSessionExtension")
    .config("spark.sql.catalog.spark_catalog",
            "org.apache.spark.sql.delta.catalog.DeltaCatalog")
# Register UC as a Spark catalog with M2M OAuth
    .config(f"spark.sql.catalog.{UC_CATALOG}",
            "io.unitycatalog.spark.UCSingleCatalog")
    .config(f"spark.sql.catalog.{UC_CATALOG}.uri", DATABRICKS_HOST)
    .config(f"spark.sql.catalog.{UC_CATALOG}.auth.type", "oauth")
    .config(f"spark.sql.catalog.{UC_CATALOG}.auth.oauth.uri",
            f"{DATABRICKS_HOST}/oidc/v1/token")
    .config(f"spark.sql.catalog.{UC_CATALOG}.auth.oauth.clientId",
            DATABRICKS_CLIENT_ID)
    .config(f"spark.sql.catalog.{UC_CATALOG}.auth.oauth.clientSecret",
            DATABRICKS_CLIENT_SECRET)
    .config(f"spark.sql.catalog.{UC_CATALOG}.renewCredential.enabled", "true")
    .config("spark.sql.defaultCatalog", UC_CATALOG)
    # UC returns s3:// URIs; map them to Hadoop's S3A FileSystem
    .config("spark.hadoop.fs.s3.impl",
            "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.fs.AbstractFileSystem.s3.impl",
            "org.apache.hadoop.fs.s3a.S3A")
).getOrCreate()

A few config lines worth calling out: auth.type=oauth is what tells the connector to mint and refresh tokens itself, so a long-running session doesn't expire mid-job. And spark.hadoop.fs.s3.impl=S3AFileSystem is the bridge between UC's s3:// URIs and Hadoop's S3A driver.

Once that's established, reading a managed Delta table from the external Spark process looks identical to reading any other Spark table — every operation goes through UC behind the scenes:

spark.table("uc_ext_access_demo.tpch_managed.orders").show(5)
spark.sql(
    "DESCRIBE HISTORY uc_ext_access_demo.tpch_managed.orders"
).show(truncate=False)

The DESCRIBE HISTORY output will show engineInfo = Apache-Spark/4.1.1 Delta-Lake/4.2.0 for any commit produced by this external session — alongside the original Databricks-Runtime/... commit that seeded the table using a Databricks SQL Warehouse.

Apache Spark™ — batch write

The same external session can append rows or create a brand-new managed Delta table with CREATE TABLE AS SELECT:

new_df.write.mode("append").saveAsTable(
    "uc_ext_access_demo.tpch_managed.orders"
)
spark.sql("""
    CREATE TABLE uc_ext_access_demo.tpch_managed.orders_summary
    USING DELTA
    TBLPROPERTIES ('delta.feature.catalogManaged' = 'supported')
    AS
    SELECT
      o_orderstatus,
      o_orderpriority,
      count(*)             AS order_count,
      sum(o_totalprice)    AS total_value,
      avg(o_totalprice)    AS avg_value
    FROM uc_ext_access_demo.tpch_managed.orders
    GROUP BY o_orderstatus, o_orderpriority
""")

Two things worth noting:

  • `USING DELTA TBLPROPERTIES ('delta.feature.catalogManaged' = 'supported')` is required when an external engine creates a managed Delta table for the first time. This is the table-level feature flag that enables catalog-managed commits — UC would otherwise reject the create managed table request.
  • `saveAsTable` DataFrame writer just works for appending to existing managed tables; no DDL feature flag needed at append time because the property is already set.

Once both writes are complete, DESCRIBE HISTORY shows the original Databricks-Runtime CTAS sitting alongside the new Apache-Spark/4.1.1 Delta-Lake/4.2.0 commits — same managed table, multiple engine writes with catalog commits.

Apache Spark™ — Structured Streaming

Catalog-managed commits also cover Structured Streaming writes. Each micro-batch becomes a single UC commit, attributed to the external Spark engine. The pattern below uses Spark's built-in rate source, so the example is self-contained — no external broker required — but the exact same code shape works for Kafka, Kinesis, files, anything Spark can stream from.

spark.sql("""
    CREATE TABLE uc_ext_access_demo.tpch_managed.orders_stream (
        o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus STRING,
        o_totalprice DECIMAL(18,2), o_orderdate DATE,
        o_orderpriority STRING, o_clerk STRING,
        o_shippriority INT, o_comment STRING
    )
    USING DELTA
    TBLPROPERTIES ('delta.feature.catalogManaged' = 'supported')
""")

src=(
    spark.readStream.format("rate")
    .option("rowsPerSecond", 50)
    .load()
)
shaped = (
    src
    .withColumn("o_orderkey",   (lit(9_100_000_000) + col("value")).cast("bigint"))
    .withColumn("o_custkey",    ((col("value") % lit(1500)) + lit(1)).cast("bigint"))
    .withColumn("o_orderstatus", lit("O"))
    .withColumn("o_totalprice",  expr("cast(rand() * 1000 as decimal(18,2))"))
    .withColumn("o_orderdate",   current_date())
    .withColumn("o_orderpriority", lit("3-MEDIUM"))
    .withColumn("o_clerk",       lit("Clerk#external-spark-stream"))
    .withColumn("o_shippriority", lit(0))
    .withColumn("o_comment",
     concat(lit("streamed by external Spark @ batch_value="),
     col("value").cast("string")))
    .select("o_orderkey", "o_custkey", "o_orderstatus", "o_totalprice",
     "o_orderdate", "o_orderpriority", "o_clerk",
     "o_shippriority", "o_comment")
)

query = (
    shaped.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", "/tmp/orders_stream_ckpt")
    .trigger(processingTime="5 seconds")
    .toTable("uc_ext_access_demo.tpch_managed.orders_stream")
)

query.awaitTermination(30)   # run for ~30 seconds, then exit

After the stream stops, DESCRIBE HISTORY on orders_stream shows the CREATE TABLE commit (version 0) followed by a STREAMING UPDATE for each micro-batch — everyone tagged Apache-Spark/4.1.1 Delta-Lake/4.2.0. The checkpoint is a local directory; UC only coordinates the catalog commit itself, not the engine's micro-batch state.

Apache Flink — Streaming Sink to a UC Managed Delta Table

Delta Lake 4.2 ships a brand-new Flink connector built on Delta Kernel — the legacy Flink connector deprecated in Delta 4.0 alongside Delta Standalone is gone, and the new connector supports Unity Catalog managed Delta tables from inception. It is sink-only and experimental today, with exactly-once semantics backed by a Flink Sink Writer and a single global Committer — every Flink checkpoint becomes one UC commit on the target table. The connector accepts a bearer credential through the Flink catalog's token field. Personal Access Tokens (PATs) work here, and so do machine-to-machine (M2M) OAuth bearer tokens — this demo passes the M2M OAuth bearer token used by the DuckDB and Spark sections (PATs are considered legacy at Databricks).

The pattern below pairs a bounded datagen source (so the job terminates cleanly without an explicit STOP JOB) with a UC-managed Delta sink created up front with delta.feature.catalogManaged='supported'. Each Flink checkpoint produces a real Delta version coordinated by UC.

SET 'table.dml-sync' = 'true';
-- Register Unity Catalog as a Flink catalog. The Flink catalog name must
-- match the UC catalog name so 3-part identifiers resolve straight through.
CREATE CATALOG `uc_ext_access_demo` WITH (
  'type' = 'unitycatalog',
  'endpoint' = '<DATABRICKS_HOST>/',
  'token' = '<BEARER_TOKEN>'
);

CREATE TEMPORARY TABLE flink_orders_flink_src (
  seq BIGINT,
  payload STRING
) WITH (
  'connector' = 'datagen',
  'rows-per-second' = '100',
  'fields.seq.kind' = 'sequence',
  'fields.seq.start' = '1',
  'fields.seq.end' = '3000',
  'fields.payload.length' = '24'
);

INSERT INTO `uc_ext_access_demo`.`tpch_managed`.`orders_flink`
SELECT
  CAST(9200000000 + seq AS BIGINT)                 AS o_orderkey,
  CAST((seq % 1500) + 1 AS BIGINT)                 AS o_custkey,
  CAST('O' AS STRING)                              AS o_orderstatus,
  CAST(MOD(seq * 7, 1000) AS DECIMAL(18,2))        AS o_totalprice,
  CURRENT_DATE                                     AS o_orderdate,
  CAST('3-MEDIUM' AS STRING)                       AS o_orderpriority,
  CAST('Clerk#external-flink-stream' AS STRING)    AS o_clerk,
  CAST(0 AS INT)                                   AS o_shippriority,
  CONCAT('streamed by external Flink @ seq=',
  CAST(seq AS STRING))                      AS o_comment
FROM flink_orders_flink_src;
-- ---------------------------------------------------------------------------
-- Demo 2 — small INSERT into the shared `orders` table (also written by
--          Spark + DuckDB). After this, DESCRIBE HISTORY orders shows
--          commits from FOUR engines — Databricks Runtime, external
--          Apache Spark, DuckDB, and Apache Flink — all coordinated by UC.
-- ---------------------------------------------------------------------------

CREATE TEMPORARY TABLE flink_orders_shared_src (
  seq BIGINT
) WITH (
  'connector' = 'datagen',
  'rows-per-second' = '10',
  'fields.seq.kind' = 'sequence',
  'fields.seq.start' = '1',
  'fields.seq.end' = '10'
);

INSERT INTO `uc_ext_access_demo`.`tpch_managed`.`orders`
SELECT
  CAST(9300000000 + seq AS BIGINT)                 AS o_orderkey,
  CAST((seq % 1500) + 1 AS BIGINT)                 AS o_custkey,
  CAST('O' AS STRING)                              AS o_orderstatus,
  CAST(MOD(seq * 11, 1000) AS DECIMAL(18,2))       AS o_totalprice,
  CURRENT_DATE                                     AS o_orderdate,
  CAST('3-MEDIUM' AS STRING)                       AS o_orderpriority,
  CAST('Clerk#external-flink-stream' AS STRING)    AS o_clerk,
  CAST(0 AS INT)                                   AS o_shippriority,
  CONCAT('inserted into shared orders by external Flink @ seq=',
  CAST(seq AS STRING))                      AS o_comment
FROM flink_orders_shared_src;

The connector ships on Maven Central as io.delta:delta-flink (4.2.0 at the time of writing — https://central.sonatype.com/artifact/io.delta/delta-flink). It is a thin JAR — its transitive deps (delta-kernel-*, hadoop-aws, AWS SDK bundle, parquet, ...) must also be on the Flink classpath. The companion helper setup_flink_usrlib.sh wraps mvn dependency:copy-dependencies and stages the full set under flink/docker/<ver>/usrlib/ in one shot; building from source via sbt flink/assembly is documented as an alternative for master features. Setting 'table.dml-sync' to 'true' causes the Flink SQL client to block on the streaming INSERT, so the run terminates cleanly when the bounded datagen source completes.

After the run, DESCRIBE HISTORY uc_ext_access_demo.tpch_managed.orders_flink shows version 0 (the CREATE TABLE) followed by the Flink-attributed commit(s) — proof that Unity Catalog coordinated the external Flink write through catalog commits.

Auditing Cross-Engine Writes with DESCRIBE HISTORY

The end-to-end demo seeds orders with a Databricks-Runtime CTAS, then has DuckDB, external Apache Spark™, and Apache Flink take turns writing to it. After that, a single DESCRIBE HISTORY orders from any of them shows the full history of operations log — the engineInfo column tells exactly which engine produced each commit.

dkushari_0-1778088020110.png

Multiple commits, four engines, one log. Unity Catalog was the coordinator for every entry — that's the cross-engine guarantee. One small note on the table above:

  • The Databricks-Runtime/... row at version 0 came from a serverless DBSQL warehouse running the setup SQL. Anything written inside Databricks identifies itself with that Databricks runtime string in engineInfo; anything written from external engine is identified with the engine + connector versions (here, Apache-Spark/4.1.1 Delta-Lake/4.2.0, DuckDB, and Kernel-<ver>/DeltaSink for Flink).

The queries below tag every row in the orders table with the engine that wrote it, using the o_clerk column to identify each writer engine. The first query returns a per-engine count, and the second one returns a small sample of rows to show the actual records inserted by each engine.

Query 1:

dkushari_1-1778088020110.png

Query 2:

dkushari_4-1778089724668.pngdkushari_2-1778088020110.png

Ecosystem status and direction

The ecosystem continues to evolve:

  • Delta Kernel remains the common building block. New connectors only need to bridge the kernel into their engine’s I/O layer.
  • Apache Flink and StreamNative Kafka Service already include catalog-commit integrations. These systems follow the same interoperability pattern, with Delta Kernel handling the core abstraction and Unity Catalog acting as the commit coordinator.
  • Attribute-based access control (ABAC) for external reads is the next governance frontier. The goal is to extend Unity Catalog’s row- and column-level policies to queries executed by external engines against managed tables.

Conclusion

With this Beta, the open lakehouse story is finally symmetric. Store your data once as a UC-managed Delta table, govern it centrally, and let every engine — Apache Spark™ in batch, Spark Structured Streaming, DuckDB, or whatever Delta Kernel-backed engine comes next — perform DML operations against the same tables. Catalog commits make it safe; M2M OAuth and EXTERNAL_USE_SCHEMA make it governed; Delta Kernel makes it easy to extend.

Clone the GitHub repository, and try performing DML operations from Spark or DuckDB today. Note the output of DESCRIBE HISTORY on a table that shows commits from Databricks Runtime, DuckDB, external Spark, and Apache Flink in the same log, demonstrating a single, interleaved transaction history where each commit is attributed to the engineInfo column, proving cross-engine write coordination under Unity Catalog.

Appendix

Code and execution flow

A full, runnable demo is available in this GitHub repository.

 

File

Purpose

00_setup_databricks.sql

DROP + CREATE catalog + schema, clone 8 TPCH tables as managed Delta, grant EXTERNAL_USE_SCHEMA.

run_setup.py

Driver: substitutes the SP application_id, splits + submits the SQL via the Statement Execution API.

01_duckdb_read.py

DuckDB — list / SELECT / cross-table JOIN against UC managed Delta tables.

02_duckdb_insert.py

DuckDB — INSERT INTO ... VALUES and INSERT INTO ... SELECT into a managed Delta table.

03_spark_external_read.py

External Apache Spark™ — batch read + DESCRIBE HISTORY + time travel via VERSION AS OF.

04_spark_external_write.py

External Apache Spark™ — append + CREATE TABLE AS SELECT.

05_spark_streaming.py

External Apache Spark™ Structured Streaming — readStream (rate source) → writeStream.toTable() into a managed Delta table; ends with a time-travel comparison.

06_flink_streaming.py

External Apache Flink — drop+create orders_flink, DELETE prior Flink-marker rows from orders, render Flink SQL using the catalog-based form (CREATE CATALOG ... WITH 'type'='unitycatalog' + INSERT INTO <catalog>.<schema>.<table>) with the OAuth bearer token substituted, then auto-detect a running Flink JobManager Docker container and exec bin/sql-client.sh -f against it. The new Kernel-based Delta Flink Connector ships on Maven Central as io.delta:delta-flink:4.2.0 — sink-only, experimental.

07_verify_cross_engine.py

External Spark — cross-engine DESCRIBE HISTORY showing the mix of engineInfo values across the catalog.

run_all.py

One-shot pipeline runner — 9 steps: setup → duckdb read → duckdb insert → spark read → spark write → streaming → flink → verify → cleanup. Idempotent end-to-end. CLI flags: -a (all), -s (spark), -f (flink), -d (duckdb), --no-setup, --no-cleanup; combinable. At least one engine flag is required (no implicit run-all).

99_cleanup.sql

DROP SCHEMA + CATALOG IF EXISTS … CASCADE. Driven by run_cleanup.py at the end of run_all.py, or runnable standalone via the Databricks SQL editor.

run_cleanup.py

Driver: executes 99_cleanup.sql via the Statement Execution API. Runs as the final step of run_all.py, so an end-to-end run leaves the workspace in its pre-demo state.

setup_flink_usrlib.sh

One-shot Maven helper — runs mvn dependency:copy-dependencies against io.delta:delta-flink:4.2.0 and stages the connector + transitive deps (delta-kernel-*, hadoop-aws, AWS SDK bundle, parquet, ...) into a target Flink usrlib/ directory. Replaces the manual sbt flink/assembly path for blog readers.

_common.py

Shared OAuth, Spark builder, DuckDB attach helpers.

requirements.txt

Pinned Python package versions.

.env.example

Template for workspace host + service principal credentials.

README.md

End-to-end run instructions.

Prerequisites to run this demo

Here are the prerequisites to run this demo in your own workspace:

  • A Databricks workspace with Unity Catalog enabled.
  • Workspace enrolled in the External Access to Unity Catalog Managed Delta Table preview (Settings → Previews).
  • External Data Access enabled on the metastore (Governance → Metastore details).
  • A service principal with an OAuth client_id + client_secret (Workspace admin → Identity and access → Service principals → OAuth secrets).
  • EXTERNAL_USE_SCHEMA, MODIFY, and SELECT permissions are granted to the principal accessing data externally
  • A serverless SQL warehouse (used by the setup script to run the catalog/schema/grants DDL).
  • Local machine with Python 3.11+ and Java 17+.
  • Apache Flink demo prerequisites:
  • Java 17+, Maven (brew install maven), and Docker Desktop installed locally.
  • Stage the Delta Flink Connector + transitive deps from Maven Central into your Flink usrlib/ using the companion helper scripts/setup_flink_usrlib.sh — it runs mvn dependency:copy-dependencies against io.delta:delta-flink:4.2.0 (https://central.sonatype.com/artifact/io.delta/delta-flink) and copies the full set of JARs (delta-kernel-*, hadoop-aws, AWS SDK bundle, parquet, ...) into the target directory. Building from source via sbt flink/assembly is documented in the upstream README (https://github.com/delta-io/delta/tree/master/flink) as an alternative for master features.
  • Clone delta-io/delta locally to get the docker-compose recipe under flink/docker/<flink-version>/, then run setup_flink_usrlib.sh against that usrlib/ directory.
  • Start the local Flink cluster with docker compose up -d from flink/docker/<flink-version>/. The 06_flink_streaming.py script auto-detects the running JobManager container, copies the rendered SQL into it, and runs bin/sql-client.sh -f.

Building and running the Flink demo locally

The Apache Flink walkthrough above is the only engine in this post that requires a one-time Flink cluster setup beyond the Python venv. The steps below assume you have already completed the workspace setup in Prerequisites for this demo. You will also need Java 17+ and Maven, and Docker installed locally.

  1. Install prereqs (one-time). On macOS:
brew install maven

# Java 17+ should already be installed; otherwise: 

brew install openjdk@17

# Docker Desktop: https://www.docker.com/products/docker-desktop/

  1. Clone delta-io/delta locally to get the docker-compose recipe.
git clone https://github.com/delta-io/delta ~/delta
  1. Stage the Delta Flink Connector + transitive deps from Maven Central into the Flink docker-compose usrlib/. The companion helper resolves io.delta:delta-flink:4.2.0 and its transitive deps (delta-kernel-*, hadoop-aws, AWS SDK bundle, parquet, ...) and copies them in one shot.
bash <path-to-blog>/scripts/setup_flink_usrlib.sh ~/delta/flink/docker/2.0/usrlib
  1. Start the local Flink cluster.
cd ~/delta/flink/docker/2.0
docker compose up -d
docker compose ps
  1. Run the Flink demo. The script renders the Flink SQL with your live OAuth bearer token, drops+creates orders_flink on Databricks, deletes prior Flink-marker rows from orders, then docker-execs the Flink SQL client against the rendered file.
python 06_flink_streaming.py
  1. Verify and tear down. The cross-engine verify shows DESCRIBE HISTORY orders with all four engineInfo values, and DESCRIBE HISTORY orders_flink with the Flink-attributed bulk commit.
python 07_verify_cross_engine.py
docker compose down                 # stop the local Flink cluster

For master builds (e.g., testing pre-release Flink connector features), the upstream sbt flink/assembly path still works — see https://github.com/delta-io/delta/tree/master/flink — but for the blog walkthrough, the Maven Central recipe above is the recommended route.

References