<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Streaming delta table - Performance with incremental refresh in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/streaming-delta-table-performance-with-incremental-refresh/m-p/58908#M31293</link>
    <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/99091"&gt;@Fnazar&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When dealing with streaming data, you might end up with many small files, which can be inefficient. Use Delta Lake's OPTIMIZE command to compact files into larger ones and ZORDER to colocate related information in the same set of files. This is particularly useful for columns that are often queried together.&lt;/P&gt;
&lt;P&gt;Select a column that results in evenly distributed data. Common choices include dates (for time-based data) or some form of categorical data that is well balanced.&lt;/P&gt;
&lt;P&gt;When creating or writing to a Delta table, you can specify the partitioning using the PARTITION BY clause. For instance, if you're partitioning by a date column: df.write.format("delta").partitionBy("date_column").save("/mnt/delta/my_table")&lt;/P&gt;
&lt;P&gt;This command will create partitions in the Delta table based on unique values in the date_column&lt;/P&gt;
&lt;P&gt;If you're ingesting streaming data into Delta Lake, consider using Auto Loader for efficient and incremental processing of new data.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.delta.io/latest/best-practices.html" target="_blank"&gt;https://docs.delta.io/latest/best-practices.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-optimize.html" target="_blank"&gt;https://docs.databricks.com/en/sql/language-manual/delta-optimize.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Thu, 01 Feb 2024 01:24:09 GMT</pubDate>
    <dc:creator>Priyanka_Biswas</dc:creator>
    <dc:date>2024-02-01T01:24:09Z</dc:date>
    <item>
      <title>Streaming delta table - Performance with incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-delta-table-performance-with-incremental-refresh/m-p/58807#M31267</link>
      <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;We are hitting performance issues with Streaming live delta table specifically when evaluating large tables of more than 10million rows.&amp;nbsp;&lt;BR /&gt;What are the workarounds to handle these streaming live tables in an attempt to load these large tables.&amp;nbsp;&lt;BR /&gt;Also, if we can use partition by then help me with the syntax please&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;</description>
      <pubDate>Wed, 31 Jan 2024 11:15:45 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-delta-table-performance-with-incremental-refresh/m-p/58807#M31267</guid>
      <dc:creator>Fnazar</dc:creator>
      <dc:date>2024-01-31T11:15:45Z</dc:date>
    </item>
    <item>
      <title>Re: Streaming delta table - Performance with incremental refresh</title>
      <link>https://community.databricks.com/t5/data-engineering/streaming-delta-table-performance-with-incremental-refresh/m-p/58908#M31293</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/99091"&gt;@Fnazar&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;When dealing with streaming data, you might end up with many small files, which can be inefficient. Use Delta Lake's OPTIMIZE command to compact files into larger ones and ZORDER to colocate related information in the same set of files. This is particularly useful for columns that are often queried together.&lt;/P&gt;
&lt;P&gt;Select a column that results in evenly distributed data. Common choices include dates (for time-based data) or some form of categorical data that is well balanced.&lt;/P&gt;
&lt;P&gt;When creating or writing to a Delta table, you can specify the partitioning using the PARTITION BY clause. For instance, if you're partitioning by a date column: df.write.format("delta").partitionBy("date_column").save("/mnt/delta/my_table")&lt;/P&gt;
&lt;P&gt;This command will create partitions in the Delta table based on unique values in the date_column&lt;/P&gt;
&lt;P&gt;If you're ingesting streaming data into Delta Lake, consider using Auto Loader for efficient and incremental processing of new data.&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.delta.io/latest/best-practices.html" target="_blank"&gt;https://docs.delta.io/latest/best-practices.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&lt;A href="https://docs.databricks.com/en/sql/language-manual/delta-optimize.html" target="_blank"&gt;https://docs.databricks.com/en/sql/language-manual/delta-optimize.html&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 01 Feb 2024 01:24:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/streaming-delta-table-performance-with-incremental-refresh/m-p/58908#M31293</guid>
      <dc:creator>Priyanka_Biswas</dc:creator>
      <dc:date>2024-02-01T01:24:09Z</dc:date>
    </item>
  </channel>
</rss>

