topic Re: Unable to apply liquid clustering to a materialized view in Data Engineering

Unable to apply liquid clustering to a materialized view

sebih — Mon, 19 Jan 2026 11:15:03 GMT

Hi everyone,

I am trying to create a materialized view with liquid clustering using the code below. However, I realized that the query performance is slower than that of a streaming table with the same data, liquid clustering, and structure. It appears that liquid clustering is not present when I check the materialized view's metadata information. See the related screenshot. When I created the table as a streaming table, I could see that liquid clustering was applied successfully.

Thanks in advance.

@DP.materialized_view( name="final_table", cluster_by=["date"], cluster_by_auto=True, table_properties={ "delta.autoOptimize.autoCompact": "auto", "delta.autoOptimize.optimizeWrite": "true" } ) def final_table(): return ( spark.read.table("my_table_1") .unionByName(spark.read.table("my_table_2").drop("id"), allowMissingColumns=True) )

Re: Unable to apply liquid clustering to a materialized view

szymon_dybczak — Mon, 19 Jan 2026 16:43:12 GMT

Hi @sebih ,

Automatic liquid clustering might not select keys for the following reasons:

- The table is too small to benefit from liquid clustering.

- You can apply automatic liquid clustering for all Unity Catalog managed tables, regardless of data and query characteristics. The heuristics decide whether it's cost-beneficial to select clustering keys.

https://docs.databricks.com/aws/en/delta/clustering#how-automatic-liquid-clustering-works

Hi @sebih, Liquid clustering is fully supported on materi...

SteveOstrowski — Sun, 08 Mar 2026 20:59:40 GMT

Hi @sebih,

Liquid clustering is fully supported on materialized views in Lakeflow Spark Declarative Pipelines (SDP), so the configuration you have should work. There are a couple of things to check that commonly cause clustering to appear missing from the metadata.

USING BOTH cluster_by AND cluster_by_auto

Your decorator specifies both cluster_by=["date"] and cluster_by_auto=True at the same time. While these can be combined (the manual columns act as initial hints before automatic selection takes over), it is worth testing with just one to rule out any interaction issue:

@dp.materialized_view(
  name="final_table",
  cluster_by=["date"],
  table_properties={
      "delta.autoOptimize.autoCompact": "auto",
      "delta.autoOptimize.optimizeWrite": "true"
  }
)
def final_table():
  return (
      spark.read.table("my_table_1")
      .unionByName(spark.read.table("my_table_2").drop("id"), allowMissingColumns=True)
  )

Or, if you want fully automatic key selection:

@dp.materialized_view(
  name="final_table",
  cluster_by_auto=True,
  table_properties={
      "delta.autoOptimize.autoCompact": "auto",
      "delta.autoOptimize.optimizeWrite": "true"
  }
)
def final_table():
  ...

FULL REFRESH MAY BE REQUIRED

If the materialized view was originally created without clustering and you later added cluster_by to the decorator, the existing table may not pick up the clustering configuration until a full refresh is run. A full refresh drops and recreates the table with the new definition. You can trigger one from the pipeline UI by clicking the dropdown next to "Refresh selection" and choosing "Full Refresh selection."

VERIFYING CLUSTERING METADATA

After the pipeline update completes, run the following to confirm clustering is in place:

DESCRIBE DETAIL catalog.schema.final_table

Look for the clusteringColumns field in the output. You can also check:

SHOW TBLPROPERTIES catalog.schema.final_table

If cluster_by_auto is enabled, you should see clusterByAuto set to true.

PIPELINE RUNTIME VERSION

Liquid clustering on materialized views and streaming tables requires your pipeline to be running on a runtime equivalent to Databricks Runtime 15.2 or higher. If your pipeline is on an older runtime, upgrade it in the pipeline settings. For the best performance, Databricks Runtime 16.4 LTS or newer is recommended.

PERFORMANCE COMPARISON WITH STREAMING TABLES

Materialized views are recomputed from scratch on each refresh (unless using an incremental refresh policy), so the data layout after a refresh may differ from a streaming table that applies clustering on write continuously. After a full refresh with clustering enabled, subsequent queries should benefit from the clustered layout. If performance still lags behind the streaming table equivalent, you can also run OPTIMIZE on the materialized view table outside the pipeline to trigger clustering compaction.

DOCUMENTATION REFERENCES

- Liquid clustering overview: https://docs.databricks.com/en/delta/clustering.html
- CREATE MATERIALIZED VIEW with CLUSTER BY: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-materialized-view.html
- Python materialized_view decorator reference: https://docs.databricks.com/aws/en/ldp/developer/ldp-python-ref-materialized-view

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.

Re: Unable to apply liquid clustering to a materialized view

sebih — Mon, 09 Mar 2026 07:17:08 GMT

Hi,

- The table size is about 2 TB.

- I had already set the liquid clustering keys before creating the materialized view. There were no issues with automatic liquid clustering.

- The issue with the liquid clustering keys not appearing in the metadata has been resolved. About a day after opening this post, the keys appeared in the metadata correctly.

- The slow performance issue persisted for quite some time. However, it now appears to be resolved. The view’s performance is currently the same as the streaming table version of the same data.

I was not able to determine the root cause, but the issues seem to have been resolved.