cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Confirmation that Ingestion Time Clustering is applied

Oliver_Angelil
Valued Contributor II

The article on Ingestion Time Clustering mentions that "Ingestion Time Clustering is enabled by default on Databricks Runtime 11.2", however how can I confirm is it active for my table? 

For example, is there a:

  • True/False "Ingestion Time Clustered" flag to confirm?
  • A new column that is created?
  • A way the partitions are structured?

Thanks,
Oliver

 

1 ACCEPTED SOLUTION

Accepted Solutions

NandiniN
Databricks Employee
Databricks Employee

Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.

Thanks & Regards,

Nandini

View solution in original post

2 REPLIES 2

NandiniN
Databricks Employee
Databricks Employee

Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.

Thanks & Regards,

Nandini

Oliver_Angelil
Valued Contributor II

Thanks @NandiniN, that was very helpful. 

I have 3 follow-up questions:

  1. If I already have a table (350GB) that has been partitioned by 3 columns: Year, Month, Day, and stored in the hive-style with subdirectories: Year=X/Month=Y/Day=Z, can I read it in, remove the partitions, and re-save it, so that it can benefit from Ingestion Time Clustering (ingestion times have still been saved in the per-file metadata)?
  2. would Ingestion Time Clustering continue to work as I append data to my table daily: spark.write.mode("append").format("delta").save("/mytable")
  3. How can I decrease/increase partition sizes? Let's say I have been appending new data hourly and for each append I have a new parquet file. After some years I may have a tens of thousands of parquet, each being say 2mb. How would I reduce the file count (increase file size)

Thank you very much,
Oliver

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now