Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:
INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLI
Example code:
df = (spark.read.format("parquet")
.option("compression", "brotli")
.load("s3://<bucket>/<path>/<file>.parquet")
df.write.saveAsTable("tmp_test")
I have a large amount of data stored with this compression, so switching right now would be difficult. It looks like Koalas supports it or I could manually ingest it by spinning up my own Spark session, but that would defeat the point of having Databricks / Delta Lake / Autoloader. Any suggestions on a work around?
edit:
More output:
Caused by: java.lang.RuntimeException: INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLI
at com.databricks.sql.io.caching.NativePageWriter$.create(Native Method)
at com.databricks.sql.io.caching.DiskCache$PageWriter.<init>(DiskCache.scala:318)
at com.databricks.sql.io.parquet.CachingPageReadStore$UnifiedCacheColumn.populate(CachingPageReadStore.java:1183)
at com.databricks.sql.io.parquet.CachingPageReadStore$UnifiedCacheColumn.lambda$getPageReader$0(CachingPageReadStore.java:1177)
at com.databricks.sql.io.caching.NativeDiskCache$.get(Native Method)
at com.databricks.sql.io.caching.DiskCache.get(DiskCache.scala:515)
at com.databricks.sql.io.parquet.CachingPageReadStore$UnifiedCacheColumn.getPageReader(CachingPageReadStore.java:1178)
at com.databricks.sql.io.parquet.CachingPageReadStore.getPageReader(CachingPageReadStore.java:1012)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.checkEndOfRowGroup(DatabricksVectorizedParquetRecordReader.java:741)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.nextBatch(DatabricksVectorizedParquetRecordReader.java:603)