cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Proposal: Switch to Zstd Compression for Parquet to Reduce S3 Costs

susmitsircar
New Contributor III

We are thinking to change the Spark configuration for Parquet files to use zstd compression.

  • Configuration: spark.sql.parquet.compression.codec = zstd

This will only affect new data written by our Spark jobs. All existing data will remain compressed with Snappy and will be fully readable without any changes.

Benefits

  1. Significant Cost Savings: zstd offers a much higher compression ratio than Snappy. This will directly reduce our S3 storage costs for all new data ingested and processed. 

  2. Proven Efficiency: This is a widely adopted industry practice for optimizing data storage costs. As detailed in this Uber Engineering blog post, the impact on storage can be substantial.

Technical Considerations & Rollout Plan

The primary consideration is ensuring all our Databraciks Runtimes (DBRs) can read and write zstd compressed files.

Dear Databricks engineers and community members, Can it be confirmed active DBR (including end of support like 7.1 LTS versions) officially support zstd

9 REPLIES 9

ManojkMohan
Honored Contributor

Older Databricks Runtimes (v7.x, including 7.1 LTS):

The official Databricks Runtime support lifecycle documentation specifies supported and end-of-life runtimes.

zstd Parquet support is not available for Databricks Runtime 7.1 LTS and other 7.x versions; these versions only support codecs like snappy, gzip, and lzo. Enabling zstd for Parquet in DBR 7.1 LTS will result in compatibility issues—jobs may fail to read/write such compressed files due to absent support

https://docs.databricks.com/aws/en/release-notes/runtime/

https://docs.databricks.com/aws/en/release-notes/runtime/databricks-runtime-ver

DBR Versionzstd SupportDocumentation
7.1 LTSNohttps://docs.databricks.com/aws/en/release-notes/runtime/
8.0+Yeshttps://docs.databricks.com/aws/en/release-notes/runtime/14.3lts
13.3 LTS+Yeshttps://docs.databricks.com/aws/en/release-notes/runtime/14.3lts
15.2+/15.4 LTSYeshttps://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/zstd_compress
16.4 LTSYeshttps://docs.databricks.com/aws/en/release-notes/runtime/

susmitsircar
New Contributor III

Thanks for the reply @ManojkMohan 

As far as I know from Spark 3.0.0 zstd is supported, so ideally any DBR >= 7.3 LTS (which uses Apache Spark 3.0.1) should work with zstd

Is my understanding wrong here?

No, your understanding is not correct. While Apache Spark 3.0.0 introduced zstd compression support, Databricks Runtime (DBR) 7.3 LTS, which includes Spark 3.0.1, does not officially support using zstd compression for Parquet files on the Databricks platform. The official Databricks release notes and documentation clearly indicate that native zstd Parquet compression support starts from DBR 8.0 and above. Using zstd compression on DBR 7.3 LTS can lead to compatibility issues such as job failures or unreadable files.

For official confirmation, see the Databricks Runtime release notes and support details here:
https://docs.databricks.com/aws/en/release-notes/runtime/

So, it is important to upgrade your DBR clusters to version 8.0 or later before adopting zstd compression for Parquet

the way i read it 

The presence of zstd-jni in the Databricks Runtime 7.3 LTS release notes primarily indicates that the native JNI library for zstd compression is included in that runtime version. However, this does not equate to full official support for using zstd compression as the Parquet codec within the Databricks platform.

While Apache Spark 3.0.0 and above introduced zstd compression support, Databricks Runtime 7.3 LTS (which includes Spark 3.0.1) does not officially enable or support writing or reading Parquet files compressed with zstd. The Parquet compression codec support for zstd was formally introduced and supported starting from Databricks Runtime 8.0.

Therefore, despite the inclusion of the zstd-jni library in DBR 7.3 LTS (see release notes — https://docs.databricks.com/aws/en/archive/runtime-release-notes/7.3lts), you should not rely on DBR 7.3 LTS for production workloads involving zstd compressed Parquet files, as this can lead to compatibility issues or failures.

For official confirmation and compatibility details, consult the Databricks Runtime release notes:
https://docs.databricks.com/aws/en/release-notes/runtime/

susmitsircar
New Contributor III

zstd Parquet support is not available for Databricks Runtime 7.1 LTS and other 7.x versions; these versions only support codecs like snappy, gzip, and lzo. Enabling zstd for Parquet in DBR 7.1 LTS will result in compatibility issues—jobs may fail to read/write such compressed files due to absent support

For official confirmation and compatibility details, consult the Databricks Runtime release notes:
https://docs.databricks.com/aws/en/release-notes/runtime/

I feel its more of a LLM created response as I dont see anything useful in the runtime release notes related to zstd

Sorry for that may be i should have added screen shots n the earlier comment itself                                                    I am inferring based on the links inside https://docs.databricks.com/aws/en/release-notes/runtime/

ManojkMohan_0-1759153820202.png

and inside 16.4 i see

ManojkMohan_1-1759153901578.png

In summary 

Though you see in release notes

ManojkMohan_2-1759154120655.png

official Parquet zstd compression support is recognized starting with DBR 8.0 in later product documentation, hence using zstd with DBR 7.x remains unofficial and MAY carry risk.

The most definitive check is to perform dedicated testing on DBR 7.3 LTS clusters by writing and reading Parquet files compressed with zstd. Watch for job errors, unreadable file errors, or degraded performance . Hope this helps

Sorry for that may be i should have added screen shots n the earlier comment itself                                                    I am inferring based on the links inside https://docs.databricks.com/aws/en/release-notes/runtime/

ManojkMohan_0-1759153820202.png

and inside 16.4 i see

ManojkMohan_1-1759153901578.png

In summary 

Though you see ZSTD in release notes

ManojkMohan_2-1759154120655.png

official Parquet zstd compression support is recognized starting with DBR 8.0 in later product documentation, hence using zstd with DBR 7.x remains unofficial and MAY carry risk.( especially for your use case of using parquet and zstd )

The most definitive check is to perform dedicated testing on DBR 7.3 LTS clusters by writing and reading Parquet files compressed with zstd. Watch for job errors, unreadable file errors, or degraded performance . Hope this helps

susmitsircar
New Contributor III

Yes my believe is it should support 7.3 LTS as well, we will prove it with thorough testing

Thanks for the discussion. Cheers

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now