Hi @sebih,
Liquid clustering is fully supported on materialized views in Lakeflow Spark Declarative Pipelines (SDP), so the configuration you have should work. There are a couple of things to check that commonly cause clustering to appear missing from the metadata.
USING BOTH cluster_by AND cluster_by_auto
Your decorator specifies both cluster_by=["date"] and cluster_by_auto=True at the same time. While these can be combined (the manual columns act as initial hints before automatic selection takes over), it is worth testing with just one to rule out any interaction issue:
@dp.materialized_view(
name="final_table",
cluster_by=["date"],
table_properties={
"delta.autoOptimize.autoCompact": "auto",
"delta.autoOptimize.optimizeWrite": "true"
}
)
def final_table():
return (
spark.read.table("my_table_1")
.unionByName(spark.read.table("my_table_2").drop("id"), allowMissingColumns=True)
)
Or, if you want fully automatic key selection:
@dp.materialized_view(
name="final_table",
cluster_by_auto=True,
table_properties={
"delta.autoOptimize.autoCompact": "auto",
"delta.autoOptimize.optimizeWrite": "true"
}
)
def final_table():
...
FULL REFRESH MAY BE REQUIRED
If the materialized view was originally created without clustering and you later added cluster_by to the decorator, the existing table may not pick up the clustering configuration until a full refresh is run. A full refresh drops and recreates the table with the new definition. You can trigger one from the pipeline UI by clicking the dropdown next to "Refresh selection" and choosing "Full Refresh selection."
VERIFYING CLUSTERING METADATA
After the pipeline update completes, run the following to confirm clustering is in place:
DESCRIBE DETAIL catalog.schema.final_table
Look for the clusteringColumns field in the output. You can also check:
SHOW TBLPROPERTIES catalog.schema.final_table
If cluster_by_auto is enabled, you should see clusterByAuto set to true.
PIPELINE RUNTIME VERSION
Liquid clustering on materialized views and streaming tables requires your pipeline to be running on a runtime equivalent to Databricks Runtime 15.2 or higher. If your pipeline is on an older runtime, upgrade it in the pipeline settings. For the best performance, Databricks Runtime 16.4 LTS or newer is recommended.
PERFORMANCE COMPARISON WITH STREAMING TABLES
Materialized views are recomputed from scratch on each refresh (unless using an incremental refresh policy), so the data layout after a refresh may differ from a streaming table that applies clustering on write continuously. After a full refresh with clustering enabled, subsequent queries should benefit from the clustered layout. If performance still lags behind the streaming table equivalent, you can also run OPTIMIZE on the materialized view table outside the pipeline to trigger clustering compaction.
DOCUMENTATION REFERENCES
- Liquid clustering overview: https://docs.databricks.com/en/delta/clustering.html
- CREATE MATERIALIZED VIEW with CLUSTER BY: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-ddl-create-materialized-view.html
- Python materialized_view decorator reference: https://docs.databricks.com/aws/en/ldp/developer/ldp-python-ref-materialized-view
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.