Databricks Community

Oliver_Angelil · ‎07-27-2023

The article on Ingestion Time Clustering mentions that "Ingestion Time Clustering is enabled by default on Databricks Runtime 11.2", however how can I confirm is it active for my table?

For example, is there a:

True/False "Ingestion Time Clustered" flag to confirm?
A new column that is created?
A way the partitions are structured?

Thanks,
Oliver

NandiniN · ‎07-27-2023

Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.

Thanks & Regards,

Nandini

View solution in original post

NandiniN · ‎07-27-2023

Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.

Thanks & Regards,

Nandini

Oliver_Angelil · ‎07-27-2023

Thanks @NandiniN, that was very helpful.

I have 3 follow-up questions:

If I already have a table (350GB) that has been partitioned by 3 columns: Year, Month, Day, and stored in the hive-style with subdirectories: Year=X/Month=Y/Day=Z, can I read it in, remove the partitions, and re-save it, so that it can benefit from Ingestion Time Clustering (ingestion times have still been saved in the per-file metadata)?
would Ingestion Time Clustering continue to work as I append data to my table daily: spark.write.mode("append").format("delta").save("/mytable")
How can I decrease/increase partition sizes? Let's say I have been appending new data hourly and for each append I have a new parquet file. After some years I may have a tens of thousands of parquet, each being say 2mb. How would I reduce the file count (increase file size)

Thank you very much,
Oliver

Databricks Community

Confirmation that Ingestion Time Clustering is applied

Join Us as a Local Community Builder!

PSA: Community Edition retires on January 1, 2026. Move to the Free Edition today to keep your work.

🎤 Call for Presentations: Data + AI Summit 2026 is Open!

Last Chance: Help Shape the 2026 Data + AI Summit | Win a Full Conference Pass

🌟 Community Pulse: Your Weekly Roundup! December 05 – 11, 2025

Jaipur Usergroup First Virtual Meetup: AI/BI Genie + Data Science Careers — 19 Dec | 6 PM IST