Databricks Community

susmitsircar · ‎09-29-2025

We are thinking to change the Spark configuration for Parquet files to use zstd compression.

Configuration: spark.sql.parquet.compression.codec = zstd

This will only affect new data written by our Spark jobs. All existing data will remain compressed with Snappy and will be fully readable without any changes.

Benefits

Significant Cost Savings: zstd offers a much higher compression ratio than Snappy. This will directly reduce our S3 storage costs for all new data ingested and processed.
Proven Efficiency: This is a widely adopted industry practice for optimizing data storage costs. As detailed in this Uber Engineering blog post, the impact on storage can be substantial.

Technical Considerations & Rollout Plan

The primary consideration is ensuring all our Databraciks Runtimes (DBRs) can read and write zstd compressed files.

Dear Databricks engineers and community members, Can it be confirmed active DBR (including end of support like 7.1 LTS versions) officially support zstd

ManojkMohan · ‎09-29-2025

Older Databricks Runtimes (v7.x, including 7.1 LTS):

The official Databricks Runtime support lifecycle documentation specifies supported and end-of-life runtimes.

zstd Parquet support is not available for Databricks Runtime 7.1 LTS and other 7.x versions; these versions only support codecs like snappy, gzip, and lzo. Enabling zstd for Parquet in DBR 7.1 LTS will result in compatibility issues—jobs may fail to read/write such compressed files due to absent support

https://docs.databricks.com/aws/en/release-notes/runtime/

https://docs.databricks.com/aws/en/release-notes/runtime/databricks-runtime-ver

DBR Version	zstd Support	Documentation
7.1 LTS	No	https://docs.databricks.com/aws/en/release-notes/runtime/
8.0+	Yes	https://docs.databricks.com/aws/en/release-notes/runtime/14.3lts
13.3 LTS+	Yes	https://docs.databricks.com/aws/en/release-notes/runtime/14.3lts
15.2+/15.4 LTS	Yes	https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/functions/zstd_compress
16.4 LTS	Yes	https://docs.databricks.com/aws/en/release-notes/runtime/

susmitsircar · ‎09-29-2025

Thanks for the reply @ManojkMohan

As far as I know from Spark 3.0.0 zstd is supported, so ideally any DBR >= 7.3 LTS (which uses Apache Spark 3.0.1) should work with zstd

Is my understanding wrong here?

ManojkMohan · ‎09-29-2025

No, your understanding is not correct. While Apache Spark 3.0.0 introduced zstd compression support, Databricks Runtime (DBR) 7.3 LTS, which includes Spark 3.0.1, does not officially support using zstd compression for Parquet files on the Databricks platform. The official Databricks release notes and documentation clearly indicate that native zstd Parquet compression support starts from DBR 8.0 and above. Using zstd compression on DBR 7.3 LTS can lead to compatibility issues such as job failures or unreadable files.

For official confirmation, see the Databricks Runtime release notes and support details here:
https://docs.databricks.com/aws/en/release-notes/runtime/

So, it is important to upgrade your DBR clusters to version 8.0 or later before adopting zstd compression for Parquet

susmitsircar · ‎09-29-2025

but I can see zstd-jni in the release notes

https://docs.databricks.com/aws/en/archive/runtime-release-notes/7.3lts

ManojkMohan · ‎09-29-2025

the way i read it

The presence of zstd-jni in the Databricks Runtime 7.3 LTS release notes primarily indicates that the native JNI library for zstd compression is included in that runtime version. However, this does not equate to full official support for using zstd compression as the Parquet codec within the Databricks platform.

While Apache Spark 3.0.0 and above introduced zstd compression support, Databricks Runtime 7.3 LTS (which includes Spark 3.0.1) does not officially enable or support writing or reading Parquet files compressed with zstd. The Parquet compression codec support for zstd was formally introduced and supported starting from Databricks Runtime 8.0.

Therefore, despite the inclusion of the zstd-jni library in DBR 7.3 LTS (see release notes — https://docs.databricks.com/aws/en/archive/runtime-release-notes/7.3lts), you should not rely on DBR 7.3 LTS for production workloads involving zstd compressed Parquet files, as this can lead to compatibility issues or failures.

For official confirmation and compatibility details, consult the Databricks Runtime release notes:
https://docs.databricks.com/aws/en/release-notes/runtime/

susmitsircar · ‎09-29-2025

zstd Parquet support is not available for Databricks Runtime 7.1 LTS and other 7.x versions; these versions only support codecs like snappy, gzip, and lzo. Enabling zstd for Parquet in DBR 7.1 LTS will result in compatibility issues—jobs may fail to read/write such compressed files due to absent support

For official confirmation and compatibility details, consult the Databricks Runtime release notes:
https://docs.databricks.com/aws/en/release-notes/runtime/

I feel its more of a LLM created response as I dont see anything useful in the runtime release notes related to zstd

ManojkMohan · ‎09-29-2025

Sorry for that may be i should have added screen shots n the earlier comment itself I am inferring based on the links inside https://docs.databricks.com/aws/en/release-notes/runtime/

and inside 16.4 i see

In summary

Though you see in release notes

official Parquet zstd compression support is recognized starting with DBR 8.0 in later product documentation, hence using zstd with DBR 7.x remains unofficial and MAY carry risk.

The most definitive check is to perform dedicated testing on DBR 7.3 LTS clusters by writing and reading Parquet files compressed with zstd. Watch for job errors, unreadable file errors, or degraded performance . Hope this helps

ManojkMohan · ‎09-29-2025

Sorry for that may be i should have added screen shots n the earlier comment itself I am inferring based on the links inside https://docs.databricks.com/aws/en/release-notes/runtime/

and inside 16.4 i see

In summary

Though you see ZSTD in release notes

official Parquet zstd compression support is recognized starting with DBR 8.0 in later product documentation, hence using zstd with DBR 7.x remains unofficial and MAY carry risk.( especially for your use case of using parquet and zstd )

The most definitive check is to perform dedicated testing on DBR 7.3 LTS clusters by writing and reading Parquet files compressed with zstd. Watch for job errors, unreadable file errors, or degraded performance . Hope this helps