cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Issue while reading external iceberg table from GCS path using spark SQL

Arvind007
New Contributor II
 
df = spark.sql("select * from bqms_table;");
df.show();
ENV - DBRT 16.3 (includes Apache Spark 3.5.2, Scala 2.12)
 
Py4JJavaError: An error occurred while calling o471.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7) (10.0.0.17 executor driver): java.lang.UnsupportedOperationException: Byte-buffer read unsupported by com.databricks.common.filesystem.LokiGCSInputStream at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:160) at com.databricks.spark.metrics.FSInputStreamWithMetrics.$anonfun$read$1(FileSystemWithMetrics.scala:77) at com.databricks.spark.metrics.FSInputStreamWithMetrics.withTimeAndBytesReadMetric(FileSystemWithMetrics.scala:67) at com.databricks.spark.metrics.FSInputStreamWithMetrics.read(FileSystemWithMetrics.scala:77) at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:156) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.util.H2SeekableInputStream$H2Reader.read(H2SeekableInputStream.java:89) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:108) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.util.H2SeekableInputStream.readFully(H2SeekableInputStream.java:83) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:622) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:934) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.<init>(ParquetFileReader.java:925) at org.apache.iceberg.shaded.org.apache.parquet.hadoop.ParquetFileReader.open(ParquetFileReader.java:710) at org.apache.iceberg.parquet.ReadConf.newReader(ReadConf.java:194) at org.apache.iceberg.parquet.ReadConf.<init>(ReadConf.java:76) at org.apache.iceberg.parquet.VectorizedParquetReader.init(VectorizedParquetReader.java:90) at org.apache.iceberg.parquet.VectorizedParquetReader.iterator(VectorizedParquetReader.java:99) at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:116) at org.apache.iceberg.spark.source.BatchDataReader.open(BatchDataReader.java:43) at org.apache.iceberg.spark.source.BaseReader.next(BaseReader.java:134) at org.apache.spark.sql.execution.datasources.v2.PartitionIterator.hasNext(DataSourceRDD.scala:122) at org.apache.spark.sql.execution.datasources.v2.MetricsIterator.hasNext(DataSourceRDD.scala:160) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1(DataSourceRDD.scala:64) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.$anonfun$hasNext$1$adapted(DataSourceRDD.scala:64) at scala.Option.exists(Option.scala:376) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:64) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.advanceToNextIter(DataSourceRDD.scala:99) at org.apache.spark.sql.execution.datasources.v2.DataSourceRDD$$anon$1.hasNext(DataSourceRDD.scala:64) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:50)
1 ACCEPTED SOLUTION

Accepted Solutions

BigRoux
Databricks Employee
Databricks Employee

The error you're encountering is related to a compatibility issue between Databricks' GCS implementation and Apache Iceberg when trying to read Iceberg tables from Google Cloud Storage. The specific error is:

```
java.lang.UnsupportedOperationException: Byte-buffer read unsupported by com.databricks.common.filesystem.LokiGCSInputStream
```

This indicates that the Databricks GCS file system implementation (`LokiGCSInputStream`) doesn't support the byte-buffer read operations that Iceberg requires when reading Parquet files.

Potential Solutions

1. Use a Different FileIO Implementation

You need to configure Iceberg to use a different FileIO implementation that's compatible with Databricks' GCS integration. Try setting the following configuration:

```python
spark.conf.set("spark.sql.catalog.your_catalog_name.io-impl", "org.apache.iceberg.gcp.gcs.GCSFileIO")
```

2. Update Catalog Configuration

Ensure your catalog is properly configured with the correct GCS credentials and implementation:

```python
Configure Iceberg catalog
spark.conf.set("spark.sql.catalog.your_catalog_name", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.your_catalog_name.type", "hadoop")
spark.conf.set("spark.sql.catalog.your_catalog_name.warehouse", "gs://your-bucket/path")
spark.conf.set("spark.sql.catalog.your_catalog_name.io-impl", "org.apache.iceberg.gcp.gcs.GCSFileIO")
```

3. Check Iceberg Version Compatibility

The issue might be related to compatibility between Iceberg 1.5.1 and Databricks Runtime 16.3. Try using a different Iceberg version that's known to work with Databricks, such as 1.4.2:

```python
Include in your spark configuration
spark.conf.set("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2,org.apache.iceberg:iceberg-gcp-bundle:1.4.2")
```

4. Use Absolute Paths

Iceberg requires absolute paths to locate metadata files and data files. Make sure you're using the full GCS path:

```python
# Instead of using a table name reference
df = spark.sql("SELECT * FROM gs://your-bucket/path/to/table")
```

5. Consider Using Unity Catalog

If possible, consider using Databricks Unity Catalog with Iceberg reads enabled, which provides better integration:

```sql
CREATE TABLE T(c1 INT) TBLPROPERTIES(
'delta.columnMapping.mode' = 'name',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);
```

This is a known issue with Iceberg and certain file system implementations that don't support byte-buffer reads. The error occurs during the reading of Parquet file footers, which Iceberg uses to build its metadata model.

View solution in original post

3 REPLIES 3

BigRoux
Databricks Employee
Databricks Employee

The error you're encountering is related to a compatibility issue between Databricks' GCS implementation and Apache Iceberg when trying to read Iceberg tables from Google Cloud Storage. The specific error is:

```
java.lang.UnsupportedOperationException: Byte-buffer read unsupported by com.databricks.common.filesystem.LokiGCSInputStream
```

This indicates that the Databricks GCS file system implementation (`LokiGCSInputStream`) doesn't support the byte-buffer read operations that Iceberg requires when reading Parquet files.

Potential Solutions

1. Use a Different FileIO Implementation

You need to configure Iceberg to use a different FileIO implementation that's compatible with Databricks' GCS integration. Try setting the following configuration:

```python
spark.conf.set("spark.sql.catalog.your_catalog_name.io-impl", "org.apache.iceberg.gcp.gcs.GCSFileIO")
```

2. Update Catalog Configuration

Ensure your catalog is properly configured with the correct GCS credentials and implementation:

```python
Configure Iceberg catalog
spark.conf.set("spark.sql.catalog.your_catalog_name", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.your_catalog_name.type", "hadoop")
spark.conf.set("spark.sql.catalog.your_catalog_name.warehouse", "gs://your-bucket/path")
spark.conf.set("spark.sql.catalog.your_catalog_name.io-impl", "org.apache.iceberg.gcp.gcs.GCSFileIO")
```

3. Check Iceberg Version Compatibility

The issue might be related to compatibility between Iceberg 1.5.1 and Databricks Runtime 16.3. Try using a different Iceberg version that's known to work with Databricks, such as 1.4.2:

```python
Include in your spark configuration
spark.conf.set("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2,org.apache.iceberg:iceberg-gcp-bundle:1.4.2")
```

4. Use Absolute Paths

Iceberg requires absolute paths to locate metadata files and data files. Make sure you're using the full GCS path:

```python
# Instead of using a table name reference
df = spark.sql("SELECT * FROM gs://your-bucket/path/to/table")
```

5. Consider Using Unity Catalog

If possible, consider using Databricks Unity Catalog with Iceberg reads enabled, which provides better integration:

```sql
CREATE TABLE T(c1 INT) TBLPROPERTIES(
'delta.columnMapping.mode' = 'name',
'delta.enableIcebergCompatV2' = 'true',
'delta.universalFormat.enabledFormats' = 'iceberg'
);
```

This is a known issue with Iceberg and certain file system implementations that don't support byte-buffer reads. The error occurs during the reading of Parquet file footers, which Iceberg uses to build its metadata model.

Arvind007
New Contributor II

Similar issue exist for Azure as well https://github.com/apache/iceberg/issues/10808#issuecomment-2263673628

Can this be fixed at databricks level.

Arvind007
New Contributor II

 

Arvind007_1-1743851403642.png

Arvind007_0-1743851283272.png

Arvind007_2-1743851437384.png

I tried given solutions but it seems issue still persist. Appreciate if it can be resolved by Databricks soon for better integration b/w GCP and Databricks .

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now