Databricks Community

ideal_knee · ‎12-05-2024

I have created an Iceberg table using AWS Glue, however whenever I try to read it using a Databricks cluster, I get `java.lang.InstantiationException`. I have tried every combination of Spark configs for my Databricks compute cluster that I can think of based on Databricks, Dremio, AWS, and Iceberg documentation. Most recently I tried

```
spark.databricks.hive.metastore.glueCatalog.enabled true
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
```

I have also tried including various `spark.sql.catalog.hive_metastore...` configs as mentioned in the Iceberg docs as well, with the same result. Any guidance on the minimal Spark configs necessary (or other suggestions) to allow reading an Iceberg table with AWS Glue Data Catalog as metastore would be greatly appreciated. Thanks!

VZLA · ‎12-06-2024

To read an Iceberg table using AWS Glue Data Catalog as the metastore on a Databricks cluster, you need to configure Spark with the appropriate settings and ensure compatibility with the Iceberg runtime. Here's the example setup:

Enable AWS Glue Catalog by setting spark.databricks.hive.metastore.glueCatalog.enabled to true.
Include the Iceberg runtime JAR that matches your Spark version. For Spark 3.5, use org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0.
Add Iceberg extensions to Spark with spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.
Set the following catalog configurations:
- spark.sql.catalog.glue=org.apache.iceberg.spark.SparkCatalog
- spark.sql.catalog.glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
- spark.sql.catalog.glue.warehouse=s3://<your-warehouse-path>
- spark.sql.catalog.glue.io-impl=org.apache.iceberg.aws.s3.S3FileIO

Once the cluster is configured, you can test it by running queries such as SHOW TABLES IN glue.<database_name> or SELECT * FROM glue.<database_name>.<table_name> to validate connectivity.

Make sure your Databricks cluster has the necessary AWS credentials and permissions for Glue and S3. This guidance is based on Databricks Documentation on AWS Glue and Iceberg, with specific references to Spark configurations for Iceberg compatibility. Let us know if you need additional help!

ideal_knee · ‎12-09-2024

Thanks for your quick response. I have added the `iceberg-spark-runtime-3.5_2.12-1.7.0.jar` from iceberg.apache.org as a library in my cluster (runtime is "16.0 ML (includes Apache Spark 3.5.0, Scala 2.12)",) and have the following for my Spark config:

spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.sql.catalog.glue.catalog-impl org.apache.iceberg.aws.glue.GlueCatalog
spark.databricks.hive.metastore.glueCatalog.enabled true
spark.sql.catalog.glue.type glue
spark.sql.catalog.glue.warehouse s3://<my-bucket>/<my-prefix>/
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0
spark.master local[*, 4]
spark.sql.catalog.glue.io-impl org.apache.iceberg.aws.s3.S3FileIO
spark.databricks.cluster.profile singleNode
spark.sql.catalog.glue org.apache.iceberg.spark.SparkCatalog

When I run the cluster and try `SHOW TABLES IN glue`, it doesn't find a catalog called `glue`. Are there additional steps that need to be done to make the new `glue` catalog available? I have also tried applying the above `spark.sql.catalog...` changes using the existing catalog name `hive_metastore`, but that does not work either. Some more guidance on reading Iceberg tables with AWS Glue Data Catalog as metastore would be appreciated. The page you linked does not mention Iceberg at all. Thanks!

VZLA · ‎12-10-2024

Apologies about the link, the legacy URL should be https://docs.databricks.com/ja/archive/external-metastores/aws-glue-metastore.html#use-aws-glue-data....

ideal_knee · ‎12-10-2024

Thank you, though that link also does not mention Iceberg at all.

I am able to see the Iceberg table in Databricks in the `hive_metastore` catalog, and see the schema via `DESCRIBE`, however if I try to actually read the data, I get `java.lang.InstantiationException`. I am able to read other Parquet tables from the `hive_metastore` catalog, which is using AWS Glue Data Catalog as the metastore, however I cannot read the Iceberg table. When I run `SHOW CATALOGS`, I see 4 catalogs (hive_metastore, main, samples, and system.) No catalog with the name `glue` appears, even with the Spark config I shared previously.

VZLA · ‎12-11-2024

The details are better explained in the document you were already initially using https://iceberg.apache.org/docs/latest/spark-configuration/#replacing-the-session-catalog. The previous URL shared was from understanding there was an issue with listing the "glue" catalog that should've been created for Iceberg table's support. Second part of the problem would be actually reading the data from Iceberg table which would be more on having the right JAR file(s) and being present in the Classpath.

At this point I believe it'll be good to check your setup live through a support ticket, if that's not possible please let us know, we'll continue this way, I'll check if I can set up a similar setup and share the steps.

ideal_knee · ‎01-24-2025

In case someone happens upon this in the future, I ended up using Unity Catalog with Hive metastore federation for Glue. The Iceberg support is currently "coming soon in Public Preview."

Databricks Community

Reading an Iceberg table with AWS Glue Data Catalog as metastore

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!