Hi everyone,
I have been reviewing the documentation on integrating external Iceberg tables with Databricks. Currently, the only method I have found to read from an Iceberg REST catalog (specifically GCP BigLake in my case) is by explicitly passing the catalog configurations directly to the SparkSession.
Here is the approach I am currently using, which works successfully on standard clusters:
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
# Define specific variables
spark_catalog = "spark_catalog"
gcp_project_id = "mul-dev-databricks"
warehouse_path = f"bq://projects/{gcp_project_id}/locations/us"
gcp_scopes = "https://www.googleapis.com/auth/cloud-platform"
# Build the SparkSession with REST catalog configs
spark = SparkSession.builder.appName("BigLake_Iceberg_App") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config('spark.sql.catalog.spark_catalog', 'org.apache.iceberg.spark.SparkCatalog') \
.config('spark.sql.catalog.spark_catalog.type', 'rest') \
.config('spark.sql.catalog.spark_catalog.uri', 'https://biglake.googleapis.com/iceberg/v1/restcatalog') \
.config('spark.sql.catalog.spark_catalog.warehouse', warehouse_path) \
.config('spark.sql.catalog.spark_catalog.header.x-goog-user-project', gcp_project_id) \
.config("spark.sql.catalog.spark_catalog.gcp.auth.scopes", gcp_scopes) \
.config('spark.sql.catalog.spark_catalog.header.X-Iceberg-Access-Delegation', 'vended-credentials') \
.config('spark.sql.catalog.spark_catalog.rest.auth.type', 'org.apache.iceberg.gcp.auth.GoogleAuthManager') \
.config('spark.sql.catalog.spark_catalog.io-impl', 'org.apache.iceberg.gcp.gcs.GCSFileIO') \
.config('spark.sql.catalog.spark_catalog.rest-metrics-reporting-enabled', 'false') \
.config('spark.sql.defaultCatalog', 'spark_catalog') \
.getOrCreate()
While I can successfully access the data this way, it is not an ideal long-term solution. It would be much more powerful if Unity Catalog could mount these REST catalogs directly as external catalogs natively (similar to Unity Catalog Federation).
Relying on Spark configurations introduces a few key challenges:
Serverless Limitations: Serverless compute has strict limitations regarding custom Spark configurations. Setting these at the cluster level completely blocks us from utilizing Serverless SQL/Compute for this data.
Governance: Managing connections via code or cluster configs bypasses the centralized governance and access control benefits of Unity Catalog.
User Experience: It requires every user or job to duplicate these boilerplate configurations.
My questions to the community:
Is there currently a supported way to configure an Iceberg REST Catalog connection directly inside Unity Catalog as a Foreign/External Catalog?
If not, is there an alternative workaround that plays nicely with Databricks Serverless?
Is native Unity Catalog federation for Iceberg REST catalogs on the Databricks roadmap?
Thanks in advance for any insights!