โ07-27-2023 04:12 AM
The article on Ingestion Time Clustering mentions that "Ingestion Time Clustering is enabled by default on Databricks Runtime 11.2", however how can I confirm is it active for my table?
For example, is there a:
Thanks,
Oliver
โ07-27-2023 07:21 AM - edited โ07-27-2023 07:22 AM
Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.
Thanks & Regards,
Nandini
โ07-27-2023 07:21 AM - edited โ07-27-2023 07:22 AM
Hello @Oliver_Angelil ,
Ingestion time clustering doesn't use any field. It just uses the time that your data arrives! Ingestion time clustering uses the implicit clustering based on ingestion time, it doesn't store this time anywhere other than in the per-file metadata. It does not disturbing the natural order of the records.
To understand you'd have to look at the query profiles in the Spark UI/query profile and see how much data is scanned for the table, and compare that to the full table size. For queries where you would expect it to work, i.e., queries with a time based filter.
When it is said it is by default true, we always use this config (on unpartitioned tables). So the metric would always be "true" on DBR 11.2+, but the metric would be deceiving, because we never know if it will work for all the workloads. By that I mean - if you have ZORDER, it would not. Ingestion time clustering works for auto compaction. Optimized writes(for the data written by that write) will break the clustering.
So, all unpartitioned tables will automatically benefit from ingestion time clustering when new data is ingested. We recommend customers to not partition tables under 1TB in size on date/timestamp columns and let ingestion time clustering automatically take effect.
Thanks & Regards,
Nandini
โ07-27-2023 07:52 AM
Thanks @NandiniN, that was very helpful.
I have 3 follow-up questions:
Thank you very much,
Oliver
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now