Databricks Community

Oliver_Angelil · ‎07-27-2023

The article on Ingestion Time Clustering mentions that "Ingestion Time Clustering is enabled by default on Databricks Runtime 11.2", however how can I confirm is it active for my table?

For example, is there a:

True/False "Ingestion Time Clustered" flag to confirm?
A new column that is created?
A way the partitions are structured?

Thanks,
Oliver

NandiniN · ‎07-27-2023

Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.

Thanks & Regards,

Nandini

View solution in original post

NandiniN · ‎07-27-2023

Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.

Thanks & Regards,

Nandini

Oliver_Angelil · ‎07-27-2023

Thanks @NandiniN, that was very helpful.

I have 3 follow-up questions:

If I already have a table (350GB) that has been partitioned by 3 columns: Year, Month, Day, and stored in the hive-style with subdirectories: Year=X/Month=Y/Day=Z, can I read it in, remove the partitions, and re-save it, so that it can benefit from Ingestion Time Clustering (ingestion times have still been saved in the per-file metadata)?
would Ingestion Time Clustering continue to work as I append data to my table daily: spark.write.mode("append").format("delta").save("/mytable")
How can I decrease/increase partition sizes? Let's say I have been appending new data hourly and for each append I have a new parquet file. After some years I may have a tens of thousands of parquet, each being say 2mb. How would I reduce the file count (increase file size)

Thank you very much,
Oliver