topic Re: Cannot apply liquid clustering via DLT pipeline in Data Engineering

Cannot apply liquid clustering via DLT pipeline

TamD — Sun, 11 May 2025 23:53:33 GMT

I want to use liquid clustering on a materialised view created via a DLT pipeline, however, there doesn't appear to be a valid way to do this.

Via table properties:

@Dlt.table( name="<table name>, comment="<table description", table_properties={ "delta.clusterBy": "AUTO", ... } )

The above code produces the error:

Unknown configuration was specified: delta.clusterBy

DELTA_UNKNOWN_CONFIGURATIONUnknown configuration was specified: delta.clusterBy\

Suggestion from Genie:

@Dlt.table( name="<table_name>", comment="<table description>", table_properties={ "delta.liquidClustering.enabled": "true" ... } )

Produces the same error:

Unknown configuration was specified: delta.liquidClustering.enabled

DELTA_UNKNOWN_CONFIGURATION Unknown configuration was specified: delta.liquidClustering.enabled

Further suggestion from Genie is to use a CLUSTER BY clause:

# Enable liquid clustering spark.sql("ALTER TABLE network_banded_usage CLUSTER BY AUTO")

This produces the error:

'${command}' is not supported in spark.sql("...") API in DLT Python. Supported command: ${supportedCommands}.
UNSUPPORTED_SPARK_SQL_COMMAND'${command}' is not supported in spark.sql("...") API in DLT Python. Supported command: ${supportedCommands}.

I think this is a bug. Has anyone got liquid clustering enabled via DLT?

Re: Cannot apply liquid clustering via DLT pipeline

Louis_Frolio — Mon, 12 May 2025 18:33:10 GMT

Based on the provided context, as of now, liquid clustering cannot be enabled directly on materialized views created via a Delta Live Tables (DLT) pipeline. Attempts to set table properties such as "delta.clusterBy" or "delta.liquidClustering.enabled" produce errors because these configurations are not supported. Moreover, using a CLUSTER BY command like ALTER TABLE network_banded_usage CLUSTER BY AUTO through the spark.sql() API also fails in the DLT pipeline context due to unsupported SQL commands in Python-based DLT pipelines.

Currently, there is no support for enabling liquid clustering directly on materialized views with syntax in DLT pipelines or associated commands. However, liquid clustering is supported for Delta Lake tables managed through DLT Preview and Current channels, with clustering occurring during DLT maintenance jobs or manually via OPTIMIZE commands. Notably, the actual clustering does not happen on write but is implemented during maintenance operations like OPTIMIZE.

Hope this helps, Lou.

Re: Cannot apply liquid clustering via DLT pipeline

TamD — Tue, 13 May 2025 23:17:54 GMT

Thanks, @Louis_Frolio

My understanding is that DLT only allows for materialized views and streaming tables. When you say, "liquid clustering is supported for Delta Lake tables managed through DLT Preview and Current channels", do you mean that liquid clustering is only supported for DLT streaming tables?

Our use case requires a MERGE, which is why I was attempting to use a mat view. Streaming tables are APPEND only, and so are not suitable for this. This sounds like if we want to take advantage of liquid clustering (or, any kind of clustering?) for a table which will be receiving updates, we can't use DLT. Can you confirm?

I note that OPTIMIZE is meant to be taken care of by Predictive Optimization, now on by default.

Do you know if there are plans to allow for liquid clustering on DLT mat views in a future release?

Re: Cannot apply liquid clustering via DLT pipeline

Louis_Frolio — Wed, 14 May 2025 13:39:26 GMT

Databricks Delta Live Tables (DLT) supports liquid clustering for both streaming tables and materialized views (MVs), not just streaming tables. This means liquid clustering is available for tables managed through DLT in both Preview and Current channels, including materialized views created via DLT pipelines

As for the merge statement.

Merge does work with Delta Live Tables (DLT) pipelines but not as a direct statement; instead, the functionality is provided via the APPLY CHANGES INTO operation. This operation serves as the equivalent of the MERGE INTO command for Delta Lake tables, enabling users to process updates, inserts, and deletes from source tables.

Key details about APPLY CHANGES INTO in DLT pipelines:

1. Supported Operations: It applies INSERT and UPDATE events from the source dataset by matching primary keys and event sequencing to maintain data consistency. DELETE operations can also be handled using statements like APPLY AS DELETE WHEN in SQL, or its Python equivalent.

2. Pipeline Compatibility: The target table for APPLY CHANGES INTO must be a live table and cannot be a streaming live table.

3. Configuration Requirement: The operation needs to be explicitly enabled in the pipeline settings by adding and enabling the applyChanges configuration.

This approach bypasses limitations of simple streaming table operations, which are restricted to append-only queries, making it suitable for handling use cases requiring incremental updates to tables.

Hope this helps, Lou.

Re: Cannot apply liquid clustering via DLT pipeline

RiyazAliM — Thu, 15 May 2025 05:26:32 GMT

Hey @TamD
I was able to enable Liquid Clustering via DLT using the below Syntax.
Try it and let me know if you face any issues:

import dlt @dlt.table( comment="DLT TABLE WITH LC ENABLED", cluster_by = ["column1","more_columns"] ) def name_of_the_table(): df=logic_to_create_the_table return df

Re: Cannot apply liquid clustering via DLT pipeline

TamD — Mon, 19 May 2025 21:46:23 GMT

Thanks @RiyazAliM . I want to use cluster by auto, because the data will get queried and aggregated several different ways by different business users. I did try your code above anyway, specifying the columns to cluster by. The pipeline ran without error, but SHOW TBLPROPERTIES does not show that any clustering has been applied. These are the only properties set on the table:

delta.autoOptimize.autoCompact
delta.autoOptimize.optimizeWrite
delta.enableChangeDataFeed
delta.minReaderVersion
delta.minWriterVersion
pipelines.pipelineId

@Louis_Frolio- if automatic liquid clustering is applied to DLT table during DLT maintenance jobs -- which I believe are managed automatically by Databricks -- when should I expect to see clustering information in the table properties?

Cheers!

Re: Cannot apply liquid clustering via DLT pipeline

Mardi_Lo — Sat, 31 May 2025 08:14:13 GMT

@TamD,

DLT doesn't currently support automatic liquid clustering. I've tried adding clusterByAuto='true' to the table properties for my DLT pipelines, and the pipeline builds successfully.

However, I don't think it actually works. I feel it's just treated as a customized tag in the table properties, as I have a 300GB streaming DLT table with this setting, and there are no clustering keys chosen when I run DESCRIBE TABLE EXTENDED.

Re: Cannot apply liquid clustering via DLT pipeline

Anand13 — Mon, 02 Jun 2025 11:49:53 GMT

Hi everyone, in our project we are trying to implement liquid clustering. We are testing liquid clustering with a test table called status_update, where we need to update the status for different market IDs. We are trying to update the status_update table in parallel using the update command. spark.sql(f"update status_update set status='{status}' where mkt_id ={mkt_id}") When we run the notebook in parallel for different market IDs, we encounter a concurrency issue.

Re: Cannot apply liquid clustering via DLT pipeline

Anand13 — Mon, 02 Jun 2025 11:53:47 GMT

Hi BigRoux

In our project we are trying to implement liquid clustering. We are testing liquid clustering with a test table called status_update, where we need to update the status for different market IDs. We are trying to update the status_update table in parallel using the update command. spark.sql(f"update status_update set status='{status}' where mkt_id ={mkt_id}") When we run the notebook in parallel for different market IDs, we encounter a concurrency issue.