cancel
Showing results for 
Search instead for 
Did you mean: 
Community Discussions
cancel
Showing results for 
Search instead for 
Did you mean: 

Confirmation that Ingestion Time Clustering is applied

Oliver_Angelil
Valued Contributor II

The article on Ingestion Time Clustering mentions that "Ingestion Time Clustering is enabled by default on Databricks Runtime 11.2", however how can I confirm is it active for my table? 

For example, is there a:

  • True/False "Ingestion Time Clustered" flag to confirm?
  • A new column that is created?
  • A way the partitions are structured?

Thanks,
Oliver

 

1 ACCEPTED SOLUTION

Accepted Solutions

NandiniN
Valued Contributor II
Valued Contributor II

Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.

Thanks & Regards,

Nandini

View solution in original post

2 REPLIES 2

NandiniN
Valued Contributor II
Valued Contributor II

Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.

Thanks & Regards,

Nandini

Oliver_Angelil
Valued Contributor II

Thanks @NandiniN, that was very helpful. 

I have 3 follow-up questions:

  1. If I already have a table (350GB) that has been partitioned by 3 columns: Year, Month, Day, and stored in the hive-style with subdirectories: Year=X/Month=Y/Day=Z, can I read it in, remove the partitions, and re-save it, so that it can benefit from Ingestion Time Clustering (ingestion times have still been saved in the per-file metadata)?
  2. would Ingestion Time Clustering continue to work as I append data to my table daily: spark.write.mode("append").format("delta").save("/mytable")
  3. How can I decrease/increase partition sizes? Let's say I have been appending new data hourly and for each append I have a new parquet file. After some years I may have a tens of thousands of parquet, each being say 2mb. How would I reduce the file count (increase file size)

Thank you very much,
Oliver

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.