cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Cannot apply liquid clustering via DLT pipeline

TamD
Contributor

I want to use liquid clustering on a materialised view created via a DLT pipeline, however, there doesn't appear to be a valid way to do this.

Via table properties:

@Dlt.table(
    name="<table name>,
    comment="<table description",
    table_properties={
        "delta.clusterBy": "AUTO",
        ...
    }
)

 The above code produces the error:

Unknown configuration was specified: delta.clusterBy

DELTA_UNKNOWN_CONFIGURATIONUnknown configuration was specified: delta.clusterBy\

Suggestion from Genie:

@Dlt.table(
    name="<table_name>",
    comment="<table description>",
    table_properties={
        "delta.liquidClustering.enabled": "true"
        ...
    }
)
Produces the same error:

Unknown configuration was specified: delta.liquidClustering.enabled

DELTA_UNKNOWN_CONFIGURATION Unknown configuration was specified: delta.liquidClustering.enabled

 
Further suggestion from Genie is to use a CLUSTER BY clause:
# Enable liquid clustering
spark.sql("ALTER TABLE network_banded_usage CLUSTER BY AUTO")

This produces the error:

'${command}' is not supported in spark.sql("...") API in DLT Python. Supported command: ${supportedCommands}.
UNSUPPORTED_SPARK_SQL_COMMAND'${command}' is not supported in spark.sql("...") API in DLT Python. Supported command: ${supportedCommands}.

I think this is a bug.  Has anyone got liquid clustering enabled via DLT?

8 REPLIES 8

BigRoux
Databricks Employee
Databricks Employee
Based on the provided context, as of now, liquid clustering cannot be enabled directly on materialized views created via a Delta Live Tables (DLT) pipeline. Attempts to set table properties such as "delta.clusterBy" or "delta.liquidClustering.enabled" produce errors because these configurations are not supported. Moreover, using a CLUSTER BY command like ALTER TABLE network_banded_usage CLUSTER BY AUTO through the spark.sql() API also fails in the DLT pipeline context due to unsupported SQL commands in Python-based DLT pipelines.
 
Currently, there is no support for enabling liquid clustering directly on materialized views with syntax in DLT pipelines or associated commands. However, liquid clustering is supported for Delta Lake tables managed through DLT Preview and Current channels, with clustering occurring during DLT maintenance jobs or manually via OPTIMIZE commands. Notably, the actual clustering does not happen on write but is implemented during maintenance operations like OPTIMIZE.
 
Hope this helps, Lou.

Thanks, @BigRoux

My understanding is that DLT only allows for materialized views and streaming tables. When you say, "liquid clustering is supported for Delta Lake tables managed through DLT Preview and Current channels", do you mean that liquid clustering is only supported for DLT streaming tables?

Our use case requires a MERGE, which is why I was attempting to use a mat view.  Streaming tables are APPEND only, and so are not suitable for this. This sounds like if we want to take advantage of liquid clustering (or, any kind of clustering?) for a table which will be receiving updates, we can't use DLT.  Can you confirm? 

I note that OPTIMIZE is meant to be taken care of by Predictive Optimization, now on by default.

Do you know if there are plans to allow for liquid clustering on DLT mat views in a future release?

Anand13
New Contributor II

Hi BigRoux

In our project we are trying to implement liquid clustering. We are testing liquid clustering with a test table called status_update, where we need to update the status for different market IDs. We are trying to update the status_update table in parallel using the update command. spark.sql(f"update  status_update  set status='{status}' where  mkt_id ={mkt_id}") When we run the notebook in parallel for different market IDs, we encounter a concurrency issue. 

BigRoux
Databricks Employee
Databricks Employee

Databricks Delta Live Tables (DLT) supports liquid clustering for both streaming tables and materialized views (MVs), not just streaming tables. This means liquid clustering is available for tables managed through DLT in both Preview and Current channels, including materialized views created via DLT pipelines

As for the merge statement. 

 

Merge does work with Delta Live Tables (DLT) pipelines but not as a direct statement; instead, the functionality is provided via the APPLY CHANGES INTO operation. This operation serves as the equivalent of the MERGE INTO command for Delta Lake tables, enabling users to process updates, inserts, and deletes from source tables.
 
Key details about APPLY CHANGES INTO in DLT pipelines:
1. Supported Operations: It applies INSERT and UPDATE events from the source dataset by matching primary keys and event sequencing to maintain data consistency. DELETE operations can also be handled using statements like APPLY AS DELETE WHEN in SQL, or its Python equivalent.
2. Pipeline Compatibility: The target table for APPLY CHANGES INTO must be a live table and cannot be a streaming live table.
3. Configuration Requirement: The operation needs to be explicitly enabled in the pipeline settings by adding and enabling the applyChanges configuration.
 
This approach bypasses limitations of simple streaming table operations, which are restricted to append-only queries, making it suitable for handling use cases requiring incremental updates to tables.
 
Hope this helps, Lou.

aayrm5
Honored Contributor

Hey @TamD 
I was able to enable Liquid Clustering via DLT using the below Syntax.
Try it and let me know if you face any issues:

import dlt

@dlt.table(
    comment="DLT TABLE WITH LC ENABLED",
    cluster_by = ["column1","more_columns"]
)
def name_of_the_table():
    df=logic_to_create_the_table
    return df  
Riz

TamD
Contributor

Thanks @aayrm5 . I want to use cluster by auto, because the data will get queried and aggregated several different ways by different business users.  I did try your code above anyway, specifying the columns to cluster by.  The pipeline ran without error, but SHOW TBLPROPERTIES does not show that any clustering has been applied.  These are the only properties set on the table:

delta.autoOptimize.autoCompact
delta.autoOptimize.optimizeWrite
delta.enableChangeDataFeed
delta.minReaderVersion
delta.minWriterVersion
pipelines.pipelineId

 

@BigRoux- if automatic liquid clustering is applied to DLT table during DLT maintenance jobs -- which I believe are managed automatically by Databricks -- when should I expect to see clustering information in the table properties?

Cheers!

Mardi_Lo
New Contributor II

@TamD,

DLT doesn't currently support automatic liquid clustering. I've tried adding clusterByAuto='true' to the table properties for my DLT pipelines, and the pipeline builds successfully.

However, I don't think it actually works. I feel it's just treated as a customized tag in the table properties, as I have a 300GB streaming DLT table with this setting, and there are no clustering keys chosen when I run DESCRIBE TABLE EXTENDED.

Anand13
New Contributor II

Hi everyone, in our project we are trying to implement liquid clustering. We are testing liquid clustering with a test table called status_update, where we need to update the status for different market IDs. We are trying to update the status_update table in parallel using the update command. spark.sql(f"update  status_update  set status='{status}' where  mkt_id ={mkt_id}") When we run the notebook in parallel for different market IDs, we encounter a concurrency issue. 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now