<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: inegstion time clustering in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/38480#M5705</link>
    <description>&lt;P&gt;thank you&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could you please put a little bit more light on configuration ? So, for instance - I am performing ingestion with using DLT. Should I add extra parameters (like&amp;nbsp;&lt;STRONG&gt;&lt;SPAN class=""&gt;pipelines.autoOptimize.zOrderCols&lt;/SPAN&gt;&lt;/STRONG&gt;) or it should be done in other way?&lt;/P&gt;</description>
    <pubDate>Wed, 26 Jul 2023 10:52:51 GMT</pubDate>
    <dc:creator>mderela</dc:creator>
    <dc:date>2023-07-26T10:52:51Z</dc:date>
    <item>
      <title>inegstion time clustering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/38447#M5703</link>
      <description>&lt;P&gt;Hello, in rerence to&amp;nbsp;&lt;A href="https://www.databricks.com/blog/2022/11/18/introducing-ingestion-time-clustering-dbr-112.html" target="_blank" rel="noopener"&gt;https://www.databricks.com/blog/2022/11/18/introducing-ingestion-time-clustering-dbr-112.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I have a silly question how to use it. So let's assume that I have a few TB of not partitioned data. So, if I would like to query on data that has been ingested starting from yesterday, what I should do?&lt;/P&gt;&lt;PRE&gt;select * from mytable where &lt;STRONG&gt;WHAT_SHOULD_BE_HERE&lt;/STRONG&gt; &amp;gt;= current_timestamp() - INTERVAL 1 day&lt;/PRE&gt;&lt;P&gt;&amp;nbsp;In other words - what I need to query on to make sure that only small part of "files" will be "scaned" instead of whole dataset. It is clear for me how to achive that with using partitions but how with ingestion time clustering?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 26 Jul 2023 05:55:49 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/38447#M5703</guid>
      <dc:creator>mderela</dc:creator>
      <dc:date>2023-07-26T05:55:49Z</dc:date>
    </item>
    <item>
      <title>Re: inegstion time clustering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/38480#M5705</link>
      <description>&lt;P&gt;thank you&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Could you please put a little bit more light on configuration ? So, for instance - I am performing ingestion with using DLT. Should I add extra parameters (like&amp;nbsp;&lt;STRONG&gt;&lt;SPAN class=""&gt;pipelines.autoOptimize.zOrderCols&lt;/SPAN&gt;&lt;/STRONG&gt;) or it should be done in other way?&lt;/P&gt;</description>
      <pubDate>Wed, 26 Jul 2023 10:52:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/38480#M5705</guid>
      <dc:creator>mderela</dc:creator>
      <dc:date>2023-07-26T10:52:51Z</dc:date>
    </item>
    <item>
      <title>Re: inegstion time clustering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/41214#M5707</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp; refereing this "&lt;SPAN&gt;Remember that this will only work if you have set up ingestion time clustering for your table". &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;Can you please elobrate how can we setup "ingestion time clustering" for existing non-partitioned tables?&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 23 Aug 2023 19:49:35 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/41214#M5707</guid>
      <dc:creator>JKR</dc:creator>
      <dc:date>2023-08-23T19:49:35Z</dc:date>
    </item>
    <item>
      <title>Re: inegstion time clustering</title>
      <link>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/41374#M5709</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;Thank for sharing this, it is really helpful, but my question remains the same that how can we know&amp;nbsp;&lt;STRONG&gt;Ingestion Time Clustering &lt;/STRONG&gt;is enabled?&amp;nbsp; As per doc it is enabled by default with DBR 11.2 &amp;amp; above.&lt;/P&gt;&lt;P&gt;- Does&amp;nbsp;&lt;SPAN&gt;Ingestion Time Clustering and Liquid clustering are similar?&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;-&amp;nbsp; What about the existing non-partitioned tables? Can I enable liquid clustering on those if I upgrade my interactive clusters to u&lt;SPAN&gt;se Databricks 13.2 or above&lt;/SPAN&gt;?&lt;/P&gt;&lt;P&gt;My scenario is I have some delta non-partitioned tables around 200 to 300 GB of data in each table. and ETL requirement is to get max timestamp, so what I do is &lt;STRONG&gt;select max(timestamp) from table&lt;/STRONG&gt; every 5 minutes on those tables separately in different jobs and then further utilize thse max_timestamp in their ETL pipelines.&lt;BR /&gt;&lt;BR /&gt;max_timestamp query is taking around more than 2.5 minutes to fetch the max_timestamp from those tables. Upon check the Spark UI and DAG I found out this query is reading all the files behind the table and not pruning any file that is why it's taking that much time only to fetch max(timestamp).&lt;BR /&gt;&lt;BR /&gt;What should I do to get that max(timestamp) in lesser time (less than 10 secs) without partitioning the table as it recommended by Databricks to only partition tables if we have table size greater than 1 TB.&lt;/P&gt;&lt;P&gt;Thanks&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 24 Aug 2023 15:57:37 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/inegstion-time-clustering/m-p/41374#M5709</guid>
      <dc:creator>JKR</dc:creator>
      <dc:date>2023-08-24T15:57:37Z</dc:date>
    </item>
  </channel>
</rss>

