topic Re: Reading an Iceberg table with AWS Glue Data Catalog as metastore in Data Engineering

Reading an Iceberg table with AWS Glue Data Catalog as metastore

ideal_knee — Thu, 05 Dec 2024 22:22:54 GMT

I have created an Iceberg table using AWS Glue, however whenever I try to read it using a Databricks cluster, I get `java.lang.InstantiationException`. I have tried every combination of Spark configs for my Databricks compute cluster that I can think of based on Databricks, Dremio, AWS, and Iceberg documentation. Most recently I tried

```
spark.databricks.hive.metastore.glueCatalog.enabled true
spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
```

I have also tried including various `spark.sql.catalog.hive_metastore...` configs as mentioned in the Iceberg docs as well, with the same result. Any guidance on the minimal Spark configs necessary (or other suggestions) to allow reading an Iceberg table with AWS Glue Data Catalog as metastore would be greatly appreciated. Thanks!

Re: Reading an Iceberg table with AWS Glue Data Catalog as metastore

VZLA — Fri, 06 Dec 2024 16:20:55 GMT

To read an Iceberg table using AWS Glue Data Catalog as the metastore on a Databricks cluster, you need to configure Spark with the appropriate settings and ensure compatibility with the Iceberg runtime. Here's the example setup:

Enable AWS Glue Catalog by setting spark.databricks.hive.metastore.glueCatalog.enabled to true.
Include the Iceberg runtime JAR that matches your Spark version. For Spark 3.5, use org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0.
Add Iceberg extensions to Spark with spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.
Set the following catalog configurations:
- spark.sql.catalog.glue=org.apache.iceberg.spark.SparkCatalog
- spark.sql.catalog.glue.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
- spark.sql.catalog.glue.warehouse=s3://<your-warehouse-path>
- spark.sql.catalog.glue.io-impl=org.apache.iceberg.aws.s3.S3FileIO

Once the cluster is configured, you can test it by running queries such as SHOW TABLES IN glue.<database_name> or SELECT * FROM glue.<database_name>.<table_name> to validate connectivity.

Make sure your Databricks cluster has the necessary AWS credentials and permissions for Glue and S3. This guidance is based on Databricks Documentation on AWS Glue and Iceberg, with specific references to Spark configurations for Iceberg compatibility. Let us know if you need additional help!

Re: Reading an Iceberg table with AWS Glue Data Catalog as metastore

ideal_knee — Mon, 09 Dec 2024 22:31:17 GMT

Thanks for your quick response. I have added the `iceberg-spark-runtime-3.5_2.12-1.7.0.jar` from iceberg.apache.org as a library in my cluster (runtime is "16.0 ML (includes Apache Spark 3.5.0, Scala 2.12)",) and have the following for my Spark config:

spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions spark.sql.catalog.glue.catalog-impl org.apache.iceberg.aws.glue.GlueCatalog spark.databricks.hive.metastore.glueCatalog.enabled true spark.sql.catalog.glue.type glue spark.sql.catalog.glue.warehouse s3://<my-bucket>/<my-prefix>/ spark.jars.packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.7.0 spark.master local[*, 4] spark.sql.catalog.glue.io-impl org.apache.iceberg.aws.s3.S3FileIO spark.databricks.cluster.profile singleNode spark.sql.catalog.glue org.apache.iceberg.spark.SparkCatalog

When I run the cluster and try `SHOW TABLES IN glue`, it doesn't find a catalog called `glue`. Are there additional steps that need to be done to make the new `glue` catalog available? I have also tried applying the above `spark.sql.catalog...` changes using the existing catalog name `hive_metastore`, but that does not work either. Some more guidance on reading Iceberg tables with AWS Glue Data Catalog as metastore would be appreciated. The page you linked does not mention Iceberg at all. Thanks!

Re: Reading an Iceberg table with AWS Glue Data Catalog as metastore

VZLA — Tue, 10 Dec 2024 08:54:29 GMT

Apologies about the link, the legacy URL should be https://docs.databricks.com/ja/archive/external-metastores/aws-glue-metastore.html#use-aws-glue-data-catalog-as-a-metastore-legacy.

Re: Reading an Iceberg table with AWS Glue Data Catalog as metastore

ideal_knee — Tue, 10 Dec 2024 21:19:06 GMT

Thank you, though that link also does not mention Iceberg at all.

I am able to see the Iceberg table in Databricks in the `hive_metastore` catalog, and see the schema via `DESCRIBE`, however if I try to actually read the data, I get `java.lang.InstantiationException`. I am able to read other Parquet tables from the `hive_metastore` catalog, which is using AWS Glue Data Catalog as the metastore, however I cannot read the Iceberg table. When I run `SHOW CATALOGS`, I see 4 catalogs (hive_metastore, main, samples, and system.) No catalog with the name `glue` appears, even with the Spark config I shared previously.

Re: Reading an Iceberg table with AWS Glue Data Catalog as metastore

VZLA — Wed, 11 Dec 2024 09:50:47 GMT

The details are better explained in the document you were already initially using https://iceberg.apache.org/docs/latest/spark-configuration/#replacing-the-session-catalog. The previous URL shared was from understanding there was an issue with listing the "glue" catalog that should've been created for Iceberg table's support. Second part of the problem would be actually reading the data from Iceberg table which would be more on having the right JAR file(s) and being present in the Classpath.

At this point I believe it'll be good to check your setup live through a support ticket, if that's not possible please let us know, we'll continue this way, I'll check if I can set up a similar setup and share the steps.

Re: Reading an Iceberg table with AWS Glue Data Catalog as metastore

ideal_knee — Fri, 24 Jan 2025 18:31:48 GMT

In case someone happens upon this in the future, I ended up using Unity Catalog with Hive metastore federation for Glue. The Iceberg support is currently "coming soon in Public Preview."

Re: Reading an Iceberg table with AWS Glue Data Catalog as metastore

ozzieg_odaseva — Fri, 08 May 2026 23:12:23 GMT

I can't get the hive metastore working without getting this error non-stop: "Failed to instantiate org.apache.hadoop.mapred.FileInputFormat". I have all the previews turned on. Anyone resolved this issue? I am trying to query using Serverless SQL warehouse.