<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Using Delta Live Tables Structured Streaming for small batches? in Get Started Discussions</title>
    <link>https://community.databricks.com/t5/get-started-discussions/using-delta-live-tables-structured-streaming-for-small-batches/m-p/36848#M5407</link>
    <description>&lt;P&gt;Hi Databricks community,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a blob storage folder that will receive file drops, with 3 files in each distinct drop: ex. Files A1,B1,C1 is one drop, A2,B2,C2 is the next drop. The DLT pipeline I've got setup has a lot of joins and aggregations, currently using dlt.read [and not read_stream]. The joins only need consider data from the files within the same drop (i.e. A1 data would never need to be merged with B2 data).&amp;nbsp;&lt;/P&gt;&lt;P&gt;I do initially read the new files into a bronze layer table with a stream read, but then I'm unsure what the best method is to go from there. As I understand it dlt.read() will read all data in the 3 bronze tables (which will contain A1...AN, B1...BN, etc.) which seems to be inefficient at scale and would just re-read already processed data. So I was thinking that stream read would be the method to use - but I'd have to chose a large watermark interval (say a day) since we could get multiple datasets dropped at a time and they are fairly large.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is the streaming read with watermark the method to use here for these incremental file drops? Or is there some other design I should be considering?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!!&lt;/P&gt;</description>
    <pubDate>Mon, 03 Jul 2023 14:35:31 GMT</pubDate>
    <dc:creator>CVogel</dc:creator>
    <dc:date>2023-07-03T14:35:31Z</dc:date>
    <item>
      <title>Using Delta Live Tables Structured Streaming for small batches?</title>
      <link>https://community.databricks.com/t5/get-started-discussions/using-delta-live-tables-structured-streaming-for-small-batches/m-p/36848#M5407</link>
      <description>&lt;P&gt;Hi Databricks community,&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a blob storage folder that will receive file drops, with 3 files in each distinct drop: ex. Files A1,B1,C1 is one drop, A2,B2,C2 is the next drop. The DLT pipeline I've got setup has a lot of joins and aggregations, currently using dlt.read [and not read_stream]. The joins only need consider data from the files within the same drop (i.e. A1 data would never need to be merged with B2 data).&amp;nbsp;&lt;/P&gt;&lt;P&gt;I do initially read the new files into a bronze layer table with a stream read, but then I'm unsure what the best method is to go from there. As I understand it dlt.read() will read all data in the 3 bronze tables (which will contain A1...AN, B1...BN, etc.) which seems to be inefficient at scale and would just re-read already processed data. So I was thinking that stream read would be the method to use - but I'd have to chose a large watermark interval (say a day) since we could get multiple datasets dropped at a time and they are fairly large.&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is the streaming read with watermark the method to use here for these incremental file drops? Or is there some other design I should be considering?&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks!!&lt;/P&gt;</description>
      <pubDate>Mon, 03 Jul 2023 14:35:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/get-started-discussions/using-delta-live-tables-structured-streaming-for-small-batches/m-p/36848#M5407</guid>
      <dc:creator>CVogel</dc:creator>
      <dc:date>2023-07-03T14:35:31Z</dc:date>
    </item>
  </channel>
</rss>

