topic Re: Issue while reading external iceberg table from GCS path using spark SQL in Data Engineering

Issue while reading external iceberg table from GCS path using spark SQL

Arvind007 — Thu, 03 Apr 2025 14:10:51 GMT

df = spark.sql("select * from bqms_table;"); df.show();

ENV - DBRT 16.3 (includes Apache Spark 3.5.2, Scala 2.12)

org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.1

Py4JJavaError: An error occurred while calling o471.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7) (10.0.0.17 executor driver): java.lang.UnsupportedOperationException: Byte-buffer read unsupported by com.databricks.common.filesystem.LokiGCSInputStream at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:160) at com.databricks.spark.metrics.FSInputStreamWithMetrics.$anonfun$read$1(FileSystemWithMetrics.scala:77) at com.databricks.spark.metrics.FSInputStreamWithMetrics.withTimeAndBytesReadMetric(FileSystemWithMetrics.scala:67) at com.databricks.spark.metrics.FSInputStreamWithMetrics.read(FileSystemWithMetrics.scala:77) at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:156) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:89) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:108) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:83) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:622) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:934) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:710) at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:194) at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:76) at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90) at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99) at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:116) at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43) at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:134) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:122) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:160) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:64) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:64) at scala.Option.exists(Option.scala:376) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:64) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:99) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:64) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)

Re: Issue while reading external iceberg table from GCS path using spark SQL

Louis_Frolio — Thu, 03 Apr 2025 17:07:03 GMT

The error you're encountering is related to a compatibility issue between Databricks' GCS implementation and Apache Iceberg when trying to read Iceberg tables from Google Cloud Storage. The specific error is:

```
java.lang.UnsupportedOperationException: Byte-buffer read unsupported by com.databricks.common.filesystem.LokiGCSInputStream
```

This indicates that the Databricks GCS file system implementation (`LokiGCSInputStream`) doesn't support the byte-buffer read operations that Iceberg requires when reading Parquet files.

Potential Solutions

1. Use a Different FileIO Implementation

You need to configure Iceberg to use a different FileIO implementation that's compatible with Databricks' GCS integration. Try setting the following configuration:

```python
spark.conf.set("spark.sql.catalog.your_catalog_name.io-impl", "org.apache.iceberg.gcp.gcs.GCSFileIO")
```

2. Update Catalog Configuration

Ensure your catalog is properly configured with the correct GCS credentials and implementation:

```python
Configure Iceberg catalog
spark.conf.set("spark.sql.catalog.your_catalog_name", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.your_catalog_name.type", "hadoop")
spark.conf.set("spark.sql.catalog.your_catalog_name.warehouse", "gs://your-bucket/path")
spark.conf.set("spark.sql.catalog.your_catalog_name.io-impl", "org.apache.iceberg.gcp.gcs.GCSFileIO")
```

3. Check Iceberg Version Compatibility

The issue might be related to compatibility between Iceberg 1.5.1 and Databricks Runtime 16.3. Try using a different Iceberg version that's known to work with Databricks, such as 1.4.2:

```python
Include in your spark configuration
spark.conf.set("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2,org.apache.iceberg:iceberg-gcp-bundle:1.4.2")
```

4. Use Absolute Paths

Iceberg requires absolute paths to locate metadata files and data files. Make sure you're using the full GCS path:

```python
# Instead of using a table name reference
df = spark.sql("SELECT * FROM gs://your-bucket/path/to/table")
```

5. Consider Using Unity Catalog

If possible, consider using Databricks Unity Catalog with Iceberg reads enabled, which provides better integration:

```sql
CREATE TABLE T(c1 INT) TBLPROPERTIES(
'delta.columnMapping.mode' = 'name',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);
```

This is a known issue with Iceberg and certain file system implementations that don't support byte-buffer reads. The error occurs during the reading of Parquet file footers, which Iceberg uses to build its metadata model.

Re: Issue while reading external iceberg table from GCS path using spark SQL

Arvind007 — Sat, 05 Apr 2025 11:06:05 GMT

Similar issue exist for Azure as well https://github.com/apache/iceberg/issues/10808#issuecomment-2263673628

Can this be fixed at databricks level.

Re: Issue while reading external iceberg table from GCS path using spark SQL

Arvind007 — Sat, 05 Apr 2025 11:12:07 GMT

I tried given solutions but it seems issue still persist. Appreciate if it can be resolved by Databricks soon for better integration b/w GCP and Databricks .