Delta live type 2 scd Liquid clustering on Start a...

Kguy · ‎10-30-2024

I've created a DLT pipeline that creates type 2 SCDs and often the __Start_at and __end_at columns are beyond the first 32 columns for stat collection.

I'd like to add these columns to liquid clustering without increasing the number of columns in the stats, and in a way that is part of the table generation code (aka without having a separate ANALYZE TABLE block, unless this can be done within the DLT code itself)

Code below defines a view to filter the bronze table and uses apply changes into to push into the silver table SCD

    .view(
        name = bronze_view_name
    )
    def bronze_table_view():
        return spark.readStream.format("delta").table(f"live.{bronze_table_name}").where(silver_transformation_where_clause).selectExpr(silver_transformation_select)
    dlt.create_streaming_table(
        name = silver_table_name,
        comment = silver_table_comment,
        cluster_by = cluster_keys_silver #array of columns to include in liquid clustering
    )
    
    dlt.apply_changes(
        target = silver_table_name,
        source = bronze_view_name,
        keys = silver_cdc_keys, #array of key columns
        sequence_by = silver_cdc_sequence_by, #array of columns
        stored_as_scd_type = silver_cdc_scdtype #2
    )

The DLT pipeline uses metadata to generate tables, which makes hard coding the solution an issue

Delta live type 2 scd Liquid clustering on Start and end dates