- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 05:31 PM
Spark 3.3.1 supports the brotli compression codec, but when I use it to read parquet files from S3, I get:
INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLI
Example code:
df = (spark.read.format("parquet")
.option("compression", "brotli")
.load("s3://<bucket>/<path>/<file>.parquet")
df.write.saveAsTable("tmp_test")
I have a large amount of data stored with this compression, so switching right now would be difficult. It looks like Koalas supports it or I could manually ingest it by spinning up my own Spark session, but that would defeat the point of having Databricks / Delta Lake / Autoloader. Any suggestions on a work around?
edit:
More output:
Caused by: java.lang.RuntimeException: INVALID_ARGUMENT: Unsupported codec for Parquet page: BROTLI
at com.databricks.sql.io.caching.NativePageWriter$.create(Native Method)
at com.databricks.sql.io.caching.DiskCache$PageWriter.<init>(DiskCache.scala:318)
at com.databricks.sql.io.parquet.CachingPageReadStore$UnifiedCacheColumn.populate(CachingPageReadStore.java:1183)
at com.databricks.sql.io.parquet.CachingPageReadStore$UnifiedCacheColumn.lambda$getPageReader$0(CachingPageReadStore.java:1177)
at com.databricks.sql.io.caching.NativeDiskCache$.get(Native Method)
at com.databricks.sql.io.caching.DiskCache.get(DiskCache.scala:515)
at com.databricks.sql.io.parquet.CachingPageReadStore$UnifiedCacheColumn.getPageReader(CachingPageReadStore.java:1178)
at com.databricks.sql.io.parquet.CachingPageReadStore.getPageReader(CachingPageReadStore.java:1012)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.checkEndOfRowGroup(DatabricksVectorizedParquetRecordReader.java:741)
at com.databricks.sql.io.parquet.DatabricksVectorizedParquetRecordReader.nextBatch(DatabricksVectorizedParquetRecordReader.java:603)
- Labels:
-
Compression
-
DeltaLake
-
Parquet
-
Parquet files
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-01-2023 01:48 PM
Given the new information I appended, I looked into the Delta caching and I can disable it:
.option("spark.databricks.io.cache.enabled", False)
This works as a work around while I read these files in to save them locally in DBFS, but does it have performance repercussions? I'm only doing this to ingest files from S3 uploaded from an external process. I'm worried there might be a larger number of reads from S3 increasing ingestion costs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-31-2023 09:33 PM
Hi, Could you please check if this helps: https://spark.apache.org/docs/2.4.3/sql-data-sources-parquet.html
Also, you can refer to https://community.databricks.com/s/question/0D53f00001HKHSsCAP/how-can-i-change-the-parquet-compress...
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-01-2023 10:03 AM
Right. In the description there, it says the precedence is in order of 'compression' followed b y 'parquet.compression' followed by this option. As you can see in the code above, I am using 'compression', but I did test with this option as well. Same error.
I believe this to be an issue specific to Databrick's layer over Spark / Delta Tables, most likely that they have a codec validation and didn't add brotli, as its addition to Spark is 'more recent'.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
11-18-2024 04:38 PM
Hi , how/where do I install 'BrotliCodec' in order to use brotli compression?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-01-2023 01:48 PM
Given the new information I appended, I looked into the Delta caching and I can disable it:
.option("spark.databricks.io.cache.enabled", False)
This works as a work around while I read these files in to save them locally in DBFS, but does it have performance repercussions? I'm only doing this to ingest files from S3 uploaded from an external process. I'm worried there might be a larger number of reads from S3 increasing ingestion costs.
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)
![](/skins/images/582998B45490C7019731A5B3A872C751/responsive_peak/images/icon_anonymous_message.png)