05-11-2025 04:53 PM
I want to use liquid clustering on a materialised view created via a DLT pipeline, however, there doesn't appear to be a valid way to do this.
Via table properties:
@Dlt.table(
name="<table name>,
comment="<table description",
table_properties={
"delta.clusterBy": "AUTO",
...
}
)
The above code produces the error:
Unknown configuration was specified: delta.clusterBy
DELTA_UNKNOWN_CONFIGURATIONUnknown configuration was specified: delta.clusterBy\
Suggestion from Genie:
@Dlt.table(
name="<table_name>",
comment="<table description>",
table_properties={
"delta.liquidClustering.enabled": "true"
...
}
)
Unknown configuration was specified: delta.liquidClustering.enabled
DELTA_UNKNOWN_CONFIGURATION Unknown configuration was specified: delta.liquidClustering.enabled
# Enable liquid clustering
spark.sql("ALTER TABLE network_banded_usage CLUSTER BY AUTO")
This produces the error:
'${command}' is not supported in spark.sql("...") API in DLT Python. Supported command: ${supportedCommands}.
UNSUPPORTED_SPARK_SQL_COMMAND'${command}' is not supported in spark.sql("...") API in DLT Python. Supported command: ${supportedCommands}.
I think this is a bug. Has anyone got liquid clustering enabled via DLT?
05-12-2025 11:33 AM
"delta.clusterBy"
or "delta.liquidClustering.enabled"
produce errors because these configurations are not supported. Moreover, using a CLUSTER BY
command like ALTER TABLE network_banded_usage CLUSTER BY AUTO
through the spark.sql()
API also fails in the DLT pipeline context due to unsupported SQL commands in Python-based DLT pipelines.05-13-2025 04:17 PM
Thanks, @BigRoux
My understanding is that DLT only allows for materialized views and streaming tables. When you say, "liquid clustering is supported for Delta Lake tables managed through DLT Preview and Current channels", do you mean that liquid clustering is only supported for DLT streaming tables?
Our use case requires a MERGE, which is why I was attempting to use a mat view. Streaming tables are APPEND only, and so are not suitable for this. This sounds like if we want to take advantage of liquid clustering (or, any kind of clustering?) for a table which will be receiving updates, we can't use DLT. Can you confirm?
I note that OPTIMIZE is meant to be taken care of by Predictive Optimization, now on by default.
Do you know if there are plans to allow for liquid clustering on DLT mat views in a future release?
2 weeks ago
Hi BigRoux
In our project we are trying to implement liquid clustering. We are testing liquid clustering with a test table called status_update, where we need to update the status for different market IDs. We are trying to update the status_update table in parallel using the update command. spark.sql(f"update status_update set status='{status}' where mkt_id ={mkt_id}") When we run the notebook in parallel for different market IDs, we encounter a concurrency issue.
05-14-2025 06:39 AM
Databricks Delta Live Tables (DLT) supports liquid clustering for both streaming tables and materialized views (MVs), not just streaming tables. This means liquid clustering is available for tables managed through DLT in both Preview and Current channels, including materialized views created via DLT pipelines
As for the merge statement.
APPLY CHANGES INTO
operation. This operation serves as the equivalent of the MERGE INTO
command for Delta Lake tables, enabling users to process updates, inserts, and deletes from source tables.APPLY CHANGES INTO
in DLT pipelines:INSERT
and UPDATE
events from the source dataset by matching primary keys and event sequencing to maintain data consistency. DELETE operations can also be handled using statements like APPLY AS DELETE WHEN
in SQL, or its Python equivalent.APPLY CHANGES INTO
must be a live table and cannot be a streaming live table.applyChanges
configuration.05-14-2025 10:26 PM
Hey @TamD
I was able to enable Liquid Clustering via DLT using the below Syntax.
Try it and let me know if you face any issues:
import dlt
@dlt.table(
comment="DLT TABLE WITH LC ENABLED",
cluster_by = ["column1","more_columns"]
)
def name_of_the_table():
df=logic_to_create_the_table
return df
a month ago
Thanks @aayrm5 . I want to use cluster by auto, because the data will get queried and aggregated several different ways by different business users. I did try your code above anyway, specifying the columns to cluster by. The pipeline ran without error, but SHOW TBLPROPERTIES does not show that any clustering has been applied. These are the only properties set on the table:
delta.autoOptimize.autoCompact
delta.autoOptimize.optimizeWrite
delta.enableChangeDataFeed
delta.minReaderVersion
delta.minWriterVersion
pipelines.pipelineId
@BigRoux- if automatic liquid clustering is applied to DLT table during DLT maintenance jobs -- which I believe are managed automatically by Databricks -- when should I expect to see clustering information in the table properties?
Cheers!
2 weeks ago
DLT doesn't currently support automatic liquid clustering. I've tried adding clusterByAuto='true' to the table properties for my DLT pipelines, and the pipeline builds successfully.
However, I don't think it actually works. I feel it's just treated as a customized tag in the table properties, as I have a 300GB streaming DLT table with this setting, and there are no clustering keys chosen when I run DESCRIBE TABLE EXTENDED.
2 weeks ago
Hi everyone, in our project we are trying to implement liquid clustering. We are testing liquid clustering with a test table called status_update, where we need to update the status for different market IDs. We are trying to update the status_update table in parallel using the update command. spark.sql(f"update status_update set status='{status}' where mkt_id ={mkt_id}") When we run the notebook in parallel for different market IDs, we encounter a concurrency issue.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now