cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Behavior of Zstd Compression for Delta Tables Across Different Databricks Runtime Versions

pooja_bhumandla
New Contributor III

Hi all,

For ZSTD compression, as per the documentation, any table created with DBR 16.0 or newer (or Apache Spark 3.5+) uses Zstd as the default compression codec instead of Snappy.

I explicitly set the table property to Zstd:

spark.sql("""
ALTER TABLE my_table
SET TBLPROPERTIES ('delta.compression.codec' = 'zstd')
""")

I also ran a full optimize on the table:

OPTIMIZE my_table FULL
After the optimization, the data files are indeed compressed using Zstd.

My question is about future writes:

if this table is later written to from a cluster running DBR 15.4 (or any runtime prior to 16.0), will the new output files still use Zstd (because of the table property) or will they revert to Snappy (because DBR <16.0)?
I’d appreciate any clarification or insights on how Delta handles compression across different runtimes.

Thanks!

3 REPLIES 3

JAHNAVI
Databricks Employee
Databricks Employee

@pooja_bhumandla 

New files written by DBR 15.4 (or any pre‑16.0 runtime) will still use Zstd as long as the table property delta.compression.codec = 'zstd' remains set on the table.

When we explicitly run: ALTER TABLE my_table
SET TBLPROPERTIES ('delta.compression.codec' = 'zstd');

Any runtime that understands this property will write new Parquet files in Zstd for that table, regardless of its own default compression


Jahnavi N

@JAHNAVI 

Thanks for the clarification.

Just to make sure I’m understanding this correctly for new table creation:
If a Delta table is created on DBR 15.4 with the compression property explicitly set, for example:

CREATE TABLE my_table (
...
)
USING DELTA
TBLPROPERTIES ('delta.compression.codec' = 'zstd');


Will the initial data files written during table creation use Zstd because of the table property, or does the DBR 15.4 runtime default (Snappy) still apply at creation time?
I’m specifically asking about the codec used for the data files created as part of the initial table creation.

Thanks again for your help.

SteveOstrowski
Databricks Employee
Databricks Employee

Hi @pooja_bhumandla,

The short answer is: yes, the explicit table property will be respected by writes from older runtimes. Here is why.

HOW DELTA COMPRESSION CODEC SELECTION WORKS

When Spark writes Parquet files for a Delta table, the compression codec is determined using the following priority order:

1. Per-write option: A one-off .write.option("compression", "...") on a DataFrame write takes the highest precedence.
2. Table property: The delta.parquet.compression.codec table property (set via ALTER TABLE ... SET TBLPROPERTIES) overrides the cluster or session default.
3. Session/cluster default: The Spark config spark.sql.parquet.compression.codec is used when neither of the above is set. On DBR 16.0+ this defaults to zstd. On DBR 15.4 and earlier it defaults to snappy.

Because you explicitly set the compression codec as a table property, any standard write or OPTIMIZE operation against that table will read the table property from the Delta transaction log and use zstd, regardless of which Databricks Runtime version the writing cluster is running. The table property lives in the Delta log metadata, and all runtimes (including pre-16.0) can read it.

WHAT TO WATCH OUT FOR

1. Property naming: In DBR 16.0+ the documented table property name is delta.parquet.compression.codec. The older style delta.compression.codec is also recognized. Confirm that the property you set is actually persisted by running:

SHOW TBLPROPERTIES my_table

Look for a key like delta.parquet.compression.codec or delta.compression.codec with value zstd.

2. Per-write overrides: If any job writing to the table uses .write.option("compression", "snappy") explicitly in DataFrame API code, that will override the table property for that specific write. Make sure no upstream jobs are doing this.

3. Reading mixed-codec files: Delta Lake (and Parquet in general) stores the codec in each file's footer metadata. Readers do not need any special configuration to read files with mixed codecs in the same table. So even if some files were originally written with snappy and you later optimized them to zstd, reads will work correctly across all runtimes.

4. OPTIMIZE FULL: Running OPTIMIZE my_table FULL rewrites all data files, so after that completes, every file in the table will use zstd (assuming the table property is set). Any future incremental writes or OPTIMIZE runs will also use zstd.

VERIFYING THE CODEC IN USE

You can confirm the compression codec of individual Parquet files by inspecting the file metadata. One approach is to use the file listing in the Delta log:

DESCRIBE DETAIL my_table

This gives you the storage location. Then you can read a sample file's Parquet metadata in a notebook to verify the codec:

import pyarrow.parquet as pq
meta = pq.read_metadata("/path/to/part-00000.snappy.parquet")
print(meta.row_group(0).column(0).compression)

Note that the file extension (e.g., .snappy.parquet or .zstd.parquet) also indicates the codec used.

SUMMARY

- Your table property ensures zstd is used for all future writes, even from DBR 15.4 clusters.
- The only exception is if a write explicitly overrides compression via .write.option("compression", ...).
- Reading files with mixed codecs is fully supported.
- OPTIMIZE FULL rewrites all existing files to use the configured codec.

For more details, see the table properties reference:
https://docs.databricks.com/aws/en/delta/table-properties

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.