4 weeks ago
We are thinking to change the Spark configuration for Parquet files to use zstd compression.
Configuration: spark.sql.parquet.compression.codec = zstd
This will only affect new data written by our Spark jobs. All existing data will remain compressed with Snappy and will be fully readable without any changes.
Significant Cost Savings: zstd offers a much higher compression ratio than Snappy. This will directly reduce our S3 storage costs for all new data ingested and processed.
Proven Efficiency: This is a widely adopted industry practice for optimizing data storage costs. As detailed in this Uber Engineering blog post, the impact on storage can be substantial.
The primary consideration is ensuring all our Databraciks Runtimes (DBRs) can read and write zstd compressed files.
Dear Databricks engineers and community members, Can it be confirmed active DBR (including end of support like 7.1 LTS versions) officially support zstd
4 weeks ago
Older Databricks Runtimes (v7.x, including 7.1 LTS):
The official Databricks Runtime support lifecycle documentation specifies supported and end-of-life runtimes.
zstd Parquet support is not available for Databricks Runtime 7.1 LTS and other 7.x versions; these versions only support codecs like snappy, gzip, and lzo. Enabling zstd for Parquet in DBR 7.1 LTS will result in compatibility issues—jobs may fail to read/write such compressed files due to absent support
https://docs.databricks.com/aws/en/release-notes/runtime/
https://docs.databricks.com/aws/en/release-notes/runtime/databricks-runtime-ver
DBR Version | zstd Support | Documentation |
7.1 LTS | No | https://docs.databricks.com/aws/en/release-notes/runtime/ |
8.0+ | Yes | https://docs.databricks.com/aws/en/release-notes/runtime/14.3lts |
13.3 LTS+ | Yes | https://docs.databricks.com/aws/en/release-notes/runtime/14.3lts |
15.2+/15.4 LTS | Yes | https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/zstd_compress |
16.4 LTS | Yes | https://docs.databricks.com/aws/en/release-notes/runtime/ |
4 weeks ago
Thanks for the reply @ManojkMohan
As far as I know from Spark 3.0.0 zstd is supported, so ideally any DBR >= 7.3 LTS (which uses Apache Spark 3.0.1) should work with zstd
Is my understanding wrong here?
4 weeks ago
No, your understanding is not correct. While Apache Spark 3.0.0 introduced zstd compression support, Databricks Runtime (DBR) 7.3 LTS, which includes Spark 3.0.1, does not officially support using zstd compression for Parquet files on the Databricks platform. The official Databricks release notes and documentation clearly indicate that native zstd Parquet compression support starts from DBR 8.0 and above. Using zstd compression on DBR 7.3 LTS can lead to compatibility issues such as job failures or unreadable files.
For official confirmation, see the Databricks Runtime release notes and support details here:
https://docs.databricks.com/aws/en/release-notes/runtime/
So, it is important to upgrade your DBR clusters to version 8.0 or later before adopting zstd compression for Parquet
4 weeks ago
but I can see zstd-jni in the release notes
https://docs.databricks.com/aws/en/archive/runtime-release-notes/7.3lts
4 weeks ago
the way i read it
The presence of zstd-jni in the Databricks Runtime 7.3 LTS release notes primarily indicates that the native JNI library for zstd compression is included in that runtime version. However, this does not equate to full official support for using zstd compression as the Parquet codec within the Databricks platform.
While Apache Spark 3.0.0 and above introduced zstd compression support, Databricks Runtime 7.3 LTS (which includes Spark 3.0.1) does not officially enable or support writing or reading Parquet files compressed with zstd. The Parquet compression codec support for zstd was formally introduced and supported starting from Databricks Runtime 8.0.
Therefore, despite the inclusion of the zstd-jni library in DBR 7.3 LTS (see release notes — https://docs.databricks.com/aws/en/archive/runtime-release-notes/7.3lts), you should not rely on DBR 7.3 LTS for production workloads involving zstd compressed Parquet files, as this can lead to compatibility issues or failures.
For official confirmation and compatibility details, consult the Databricks Runtime release notes:
https://docs.databricks.com/aws/en/release-notes/runtime/
4 weeks ago
zstd Parquet support is not available for Databricks Runtime 7.1 LTS and other 7.x versions; these versions only support codecs like snappy, gzip, and lzo. Enabling zstd for Parquet in DBR 7.1 LTS will result in compatibility issues—jobs may fail to read/write such compressed files due to absent support
For official confirmation and compatibility details, consult the Databricks Runtime release notes:
https://docs.databricks.com/aws/en/release-notes/runtime/
I feel its more of a LLM created response as I dont see anything useful in the runtime release notes related to zstd
4 weeks ago - last edited 4 weeks ago
Sorry for that may be i should have added screen shots n the earlier comment itself I am inferring based on the links inside https://docs.databricks.com/aws/en/release-notes/runtime/
and inside 16.4 i see
In summary
Though you see in release notes
official Parquet zstd compression support is recognized starting with DBR 8.0 in later product documentation, hence using zstd with DBR 7.x remains unofficial and MAY carry risk.
The most definitive check is to perform dedicated testing on DBR 7.3 LTS clusters by writing and reading Parquet files compressed with zstd. Watch for job errors, unreadable file errors, or degraded performance . Hope this helps
4 weeks ago - last edited 4 weeks ago
Sorry for that may be i should have added screen shots n the earlier comment itself I am inferring based on the links inside https://docs.databricks.com/aws/en/release-notes/runtime/
and inside 16.4 i see
In summary
Though you see ZSTD in release notes
official Parquet zstd compression support is recognized starting with DBR 8.0 in later product documentation, hence using zstd with DBR 7.x remains unofficial and MAY carry risk.( especially for your use case of using parquet and zstd )
The most definitive check is to perform dedicated testing on DBR 7.3 LTS clusters by writing and reading Parquet files compressed with zstd. Watch for job errors, unreadable file errors, or degraded performance . Hope this helps
4 weeks ago
Yes my believe is it should support 7.3 LTS as well, we will prove it with thorough testing
Thanks for the discussion. Cheers
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now