cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Reading an Iceberg table with AWS Glue Data Catalog as metastore

ideal_knee
New Contributor II

I have created an Iceberg table using AWS Glue, however whenever I try to read it using a Databricks cluster, I get `java.lang.InstantiationException`. I have tried every combination of Spark configs for my Databricks compute cluster that I can think of based on Databricks, Dremio, AWS, and Iceberg documentation. Most recently I tried

```
spark.databricks.hive.metastore.glueCatalog.enabled true
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
```

I have also tried including various `spark.sql.catalog.hive_metastore...` configs as mentioned in the Iceberg docs as well, with the same result. Any guidance on the minimal Spark configs necessary (or other suggestions) to allow reading an Iceberg table with AWS Glue Data Catalog as metastore would be greatly appreciated. Thanks!

5 REPLIES 5

VZLA
Databricks Employee
Databricks Employee

To read an Iceberg table using AWS Glue Data Catalog as the metastore on a Databricks cluster, you need to configure Spark with the appropriate settings and ensure compatibility with the Iceberg runtime. Here's the example setup:

  1. Enable AWS Glue Catalog by setting spark.databricks.hive.metastore.glueCatalog.enabled to true.

  2. Include the Iceberg runtime JAR that matches your Spark version. For Spark 3.5, use org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0

  3. Add Iceberg extensions to Spark with spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.

  4. Set the following catalog configurations:

    • spark.sql.catalog.glue=org.apache.iceberg.spark.SparkCatalog
    • spark.sql.catalog.glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
    • spark.sql.catalog.glue.warehouse=s3://<your-warehouse-path>
    • spark.sql.catalog.glue.io-impl=org.apache.iceberg.aws.s3.S3FileIO

Once the cluster is configured, you can test it by running queries such as SHOW TABLES IN glue.<database_name> or SELECT * FROM glue.<database_name>.<table_name> to validate connectivity.

Make sure your Databricks cluster has the necessary AWS credentials and permissions for Glue and S3. This guidance is based on Databricks Documentation on AWS Glue and Iceberg, with specific references to Spark configurations for Iceberg compatibility. Let us know if you need additional help!

ideal_knee
New Contributor II

Thanks for your quick response. I have added the `iceberg-spark-runtime-3.5_2.12-1.7.0.jar` from iceberg.apache.org as a library in my cluster (runtime is "16.0 ML (includes Apache Spark 3.5.0, Scala 2.12)",) and have the following for my Spark config:

 

spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.glue.catalog-impl org.apache.iceberg.aws.glue.GlueCatalog
spark.databricks.hive.metastore.glueCatalog.enabled true
spark.sql.catalog.glue.type glue
spark.sql.catalog.glue.warehouse s3://<my-bucket>/<my-prefix>/
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0
spark.master local[*, 4]
spark.sql.catalog.glue.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.databricks.cluster.profile singleNode
spark.sql.catalog.glue org.apache.iceberg.spark.SparkCatalog

 

 When I run the cluster and try `SHOW TABLES IN glue`, it doesn't find a catalog called `glue`. Are there additional steps that need to be done to make the new `glue` catalog available? I have also tried applying the above `spark.sql.catalog...` changes using the existing catalog name `hive_metastore`, but that does not work either. Some more guidance on reading Iceberg tables with AWS Glue Data Catalog as metastore would be appreciated. The page you linked does not mention Iceberg at all. Thanks!

VZLA
Databricks Employee
Databricks Employee

ideal_knee
New Contributor II

Thank you, though that link also does not mention Iceberg at all.

I am able to see the Iceberg table in Databricks in the `hive_metastore` catalog, and see the schema via `DESCRIBE`, however if I try to actually read the data, I get `java.lang.InstantiationException`. I am able to read other Parquet tables from the `hive_metastore` catalog, which is using AWS Glue Data Catalog as the metastore, however I cannot read the Iceberg table. When I run `SHOW CATALOGS`, I see 4 catalogs (hive_metastore, main, samples, and system.) No catalog with the name `glue` appears, even with the Spark config I shared previously.

VZLA
Databricks Employee
Databricks Employee

The details are better explained in the document you were already initially using https://iceberg.apache.org/docs/latest/spark-configuration/#replacing-the-session-catalog. The previous URL shared was from understanding there was an issue with listing the "glue" catalog that should've been created for Iceberg table's support. Second part of the problem would be actually reading the data from Iceberg table which would be more on having the right JAR file(s) and being present in the Classpath.

At this point I believe it'll be good to check your setup live through a support ticket, if that's not possible please let us know, we'll continue this way, I'll check if I can set up a similar setup and share the steps. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group