- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
10-30-2024 06:52 AM
Hi @Kguy,
To implement liquid clustering on __start_at and __end_at columns in a Delta Live Tables (DLT) pipeline for a type 2 Slowly Changing Dimension (SCD) without manually increasing the number of columns in stats collection configure cluster_by Columns Dynamically:
Define __start_at and __end_at in cluster_by at the table level to improve query performance. This clustering will focus the layout on these columns, enhancing the performance of time-based filtering.
import dlt
from pyspark.sql import functions as F
# Define the bronze view to filter your input data
@dlt.view
def bronze_table_view():
return (
spark.readStream.format("delta")
.table(f"live.{bronze_table_name}")
.where(silver_transformation_where_clause)
.selectExpr(silver_transformation_select)
)
# Create the silver table with liquid clustering on start and end dates
dlt.create_streaming_table(
name=silver_table_name,
comment=silver_table_comment,
cluster_by=["__start_at", "__end_at"], # Dynamic liquid clustering by date columns
)
# Apply type 2 SCD logic with `apply_changes` and specify the clustering
dlt.apply_changes(
target=silver_table_name,
source="bronze_table_view",
keys=silver_cdc_keys, # Primary keys for SCD
sequence_by=silver_cdc_sequence_by, # Sequence column for ordering
stored_as_scd_type="2", # Define as type 2 SCD
)
Using these methods ensures that your __start_at and __end_at columns are effectively clustered, improving query performance without a need to extend stats collection explicitly for these columns.
Let me know if you need further details on these configurations!
Regards!
-------------------
I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark