topic How to Optimize Delta Table Performance in Databricks? in Administration & Architecture

How to Optimize Delta Table Performance in Databricks?

gardenmap — Sun, 04 May 2025 06:43:12 GMT

I'm working with large Delta tables in Databricks and noticing slower performance during read operations. I've already enabled Z-ordering and auto-optimize, but it still feels sluggish at scale. Are there best practices or settings I should adjust for better query performance? Also, is there a way to monitor the impact of each optimization?

Re: How to Optimize Delta Table Performance in Databricks?

BrickByBrick — Sun, 04 May 2025 17:00:55 GMT

Last week, I attended a Dev Connect event in London and came across a new optimization technique called Liquid Clustering (Next-gen Clustering).
Here are the Key Benefits of Liquid Clustering Over Z-Ordering , would recommended you to deep dive into it.

-No need to run OPTIMIZE manually — reduces job scheduling and compute cost.
-Automatically adapts to changing data and query patterns.
-Reduces data skew more effectively than static partitioning + ZORDER.
-Better performance for large-scale, frequently updated tables.
-Simplifies pipeline management — no need to manage clustering logic separately.

Liquid Clustering functionality and automatic clustering improvements are most robust in:
-Databricks Runtime 14.0+
-Unity Catalog-enabled tables
-Delta Lake format (version 2 or higher)

Cheers

Re: How to Optimize Delta Table Performance in Databricks?

igorborba — Sun, 04 May 2025 17:45:49 GMT

Hi @gardenmap, if possible can you detail more?

For example, in my case what I've done:

For tables above 1TB as it's can segregated by date, we've decided to enable a partition by the date column;
Independent if it's partitioned or not, we decided to make a sequence of OPTIMIZE and VACUUM for specific and necessary columns, not all 32 first columns;
As we have a lot of scenarios with the usage of MERGE INTO by each 5, 10 and 60 min, it's necessary to activate auto optimize, but apply a optimize with vacuum minimally by week.

Doubts:

When you working with your tables, are use Spark SQL API or Databricks SQL?
Area you using Databricks SQL Endpoints?
Are you what type and size of the cluster if you are using Job Cluster ou All Purpose Clusters? Machines with SSD?