Re: Delta live type 2 scd Liquid clustering on Sta...

agallard · ‎10-30-2024

To implement liquid clustering on __start_at and __end_at columns in a Delta Live Tables (DLT) pipeline for a type 2 Slowly Changing Dimension (SCD) without manually increasing the number of columns in stats collection configure cluster_by Columns Dynamically:

Define __start_at and __end_at in cluster_by at the table level to improve query performance. This clustering will focus the layout on these columns, enhancing the performance of time-based filtering.

import dlt
from pyspark.sql import functions as F

# Define the bronze view to filter your input data
@dlt.view
def bronze_table_view():
    return (
        spark.readStream.format("delta")
        .table(f"live.{bronze_table_name}")
        .where(silver_transformation_where_clause)
        .selectExpr(silver_transformation_select)
    )

# Create the silver table with liquid clustering on start and end dates
dlt.create_streaming_table(
    name=silver_table_name,
    comment=silver_table_comment,
    cluster_by=["__start_at", "__end_at"],  # Dynamic liquid clustering by date columns
)

# Apply type 2 SCD logic with `apply_changes` and specify the clustering
dlt.apply_changes(
    target=silver_table_name,
    source="bronze_table_view",
    keys=silver_cdc_keys,               # Primary keys for SCD
    sequence_by=silver_cdc_sequence_by,  # Sequence column for ordering
    stored_as_scd_type="2",              # Define as type 2 SCD
)

Using these methods ensures that your __start_at and __end_at columns are effectively clustered, improving query performance without a need to extend stats collection explicitly for these columns.

Let me know if you need further details on these configurations!

Regards!

Alfonso Gallardo
-------------------
 I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark