<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: DLT: Autoloader Perf in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48535#M28323</link>
    <description>&lt;P&gt;It can be from 600 files to up to 1.5k files. The DLT is set to Triggered in Pipeline mode and Continuous in Trigger Type in Workflows.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Fri, 06 Oct 2023 05:00:08 GMT</pubDate>
    <dc:creator>Gilg</dc:creator>
    <dc:date>2023-10-06T05:00:08Z</dc:date>
    <item>
      <title>DLT: Autoloader Perf</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48525#M28318</link>
      <description>&lt;P&gt;Hi Team,&lt;/P&gt;&lt;P&gt;I am looking for some advice to perf tune my bronze layer using DLT.&lt;/P&gt;&lt;P&gt;I have the following code very simple and yet very effective.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;@dlt.create_table(name="bronze_events",
                  comment = "New raw data ingested from storage account landing zone.")
def bronze_events():
    df = (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "json")
        .schema(schema)
        .load("abfss://data@&amp;lt;some storage account&amp;gt;.dfs.core.windows.net/0_Landing")
      )

    return df&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;that generates this DAG.&lt;/P&gt;&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="Gilg_0-1696561163925.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/4304iEB833D12B5BEA117/image-size/medium/is-moderation-mode/true?v=v2&amp;amp;px=400" role="button" title="Gilg_0-1696561163925.png" alt="Gilg_0-1696561163925.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;Before it was executing quite fast but as days goes by it is becoming more and more slower like from 2 to 5 to 12 mins. Silver and Gold are all executing less than a minute. So wondering what performance tuning I should do with the bronze layer.&lt;/P&gt;&lt;P&gt;Cheers,&lt;/P&gt;&lt;P&gt;G&lt;/P&gt;</description>
      <pubDate>Fri, 06 Oct 2023 03:07:03 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48525#M28318</guid>
      <dc:creator>Gilg</dc:creator>
      <dc:date>2023-10-06T03:07:03Z</dc:date>
    </item>
    <item>
      <title>Re: DLT: Autoloader Perf</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48528#M28321</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/11184"&gt;@Gilg&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Is it ingesting the same number of files as before?&lt;/P&gt;
&lt;P&gt;Also, you could try using Auto Loader with file notification mode. If there are too many files in the source directory, then significant amount of time would be spent on listing of the directory. We can validate this by analyzing the logs.&lt;/P&gt;</description>
      <pubDate>Fri, 06 Oct 2023 04:30:27 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48528#M28321</guid>
      <dc:creator>Tharun-Kumar</dc:creator>
      <dc:date>2023-10-06T04:30:27Z</dc:date>
    </item>
    <item>
      <title>Re: DLT: Autoloader Perf</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48535#M28323</link>
      <description>&lt;P&gt;It can be from 600 files to up to 1.5k files. The DLT is set to Triggered in Pipeline mode and Continuous in Trigger Type in Workflows.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 06 Oct 2023 05:00:08 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48535#M28323</guid>
      <dc:creator>Gilg</dc:creator>
      <dc:date>2023-10-06T05:00:08Z</dc:date>
    </item>
    <item>
      <title>Re: DLT: Autoloader Perf</title>
      <link>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48934#M28426</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/11184"&gt;@Gilg&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You mentioned that micro-batch time is around 12 minutes recently. Do we also see jobs/stages with 12 minutes in the spark ui. If that is the case, then the processing of the file itself takes 12 minutes. If not, the 12 minutes is spent on listing the directory and maintaining the checkpoint.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Oct 2023 09:49:25 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/dlt-autoloader-perf/m-p/48934#M28426</guid>
      <dc:creator>Tharun-Kumar</dc:creator>
      <dc:date>2023-10-11T09:49:25Z</dc:date>
    </item>
  </channel>
</rss>

