Re: Delta live type 2 scd Liquid clustering on Sta...

agallard · ‎10-30-2024

Good question! In Databricks, Delta Lake automatically collects statistics only for the first 32 columns in a table for performance optimization. When __start_at and __end_at columns are beyond this limit, they are excluded from automatic stats collection.

However, there are ways to improve query performance on these columns without including them in the stats.

An indirect way to ensure that __start_at and __end_at are included in the statistics is to rearrange the table layout so that these columns are within the first 32 positions. This involves recreating the table with the columns reordered so that the most relevant columns appear first.

Step by Step:

Create a temporary table with the __start_at and __end_at columns in the first positions of the schema.
Copy the data from the original table to this new table.
Delete the original table and rename the new table.

# Assuming your current table is called `silver_table_name`.
new_schema = [‘__start_at’, ‘__end_at’] + [col for col in df.columns if col not in [‘__start_at’, ‘__end_at’]]]

# Reorder the DataFrame
reordered_df = df.select(new_schema)

# Write the table with the reordered schema
reordered_df.write.mode(‘overwrite’).format(‘delta’).saveAsTable(‘new_silver_table_name’)

⚠️This is a solution that may require additional work for scheme management, especially in a production environment, and may not be feasible if the scheme is handled automatically.

Although we cannot force Delta Lake to include specific columns in the automatic statistics, you can run a manual analysis to include any column you wish.

ANALYZE TABLE your_table_name COMPUTE STATISTICS FOR COLUMNS __start_at, __end_at;

You can read more in this links: https://docs.databricks.com/en/sql/language-manual/sql-ref-syntax-aux-analyze-table.html

This command can be used in combination with ZORDER or cluster_by optimisations to improve performance without relying solely on automatic statistics.

My alternatives would be this:

Reorder the columns in the schema so that __start_at and __end_at are in the first positions.
Run ANALYZE TABLE manually to calculate column-specific statistics after each major load.

I hope it will be of help to you

Regards!

Alfonso Gallardo
-------------------
 I love working with tools like Databricks, Python, Azure, Microsoft Fabric, Azure Data Factory, and other Microsoft solutions, focusing on developing scalable and efficient solutions with Apache Spark