<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Optimizing Data Insertion Speed for JSON Files in DLT Pipeline in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/optimizing-data-insertion-speed-for-json-files-in-dlt-pipeline/m-p/70996#M34204</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;The DLT pipeline use Core product edition, as there is no features used in the code that requires the pro/advanced edition. The pipeline runs without error, the issue, as mentioned above, is that it can take up to 30 min before a row is inserted when the pipeline is set to continuous mode, which I find a bit strange.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 29 May 2024 09:52:53 GMT</pubDate>
    <dc:creator>MiBjorn</dc:creator>
    <dc:date>2024-05-29T09:52:53Z</dc:date>
    <item>
      <title>Optimizing Data Insertion Speed for JSON Files in DLT Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-data-insertion-speed-for-json-files-in-dlt-pipeline/m-p/70986#M34199</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Background:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I'm working on a data pipeline to insert JSON files as quickly as possible. Here are the details of my setup:&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;File Size:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;1.5 - 2 kB each&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;File Volume:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Approximately 30,000 files per hour&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Pipeline:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;Using Databricks Delta Live Tables (DLT) in continuous pipeline mode&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Cluster Configuration:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;STRONG&gt;Type:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;Standard_E8ds_v4 (both worker and driver)&lt;/P&gt;&lt;P&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;STRONG&gt;Enhanced Autoscaling:&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Enabled with 1-5 workers&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Storage&lt;/STRONG&gt;: Files are received continuously in an Azure Data Lake Storage (ADLS) Gen2 container&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Pre-processing:&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/STRONG&gt;JSON files are flattened before being inserted into the bronze layer&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Issue:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Despite the continuous reception of files, the records appear to be inserted into the bronze layer in batches every 20-30 minutes. This delay is causing issues with the timeliness of data processing.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Observations:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;I reviewed the metrics for the DLT pipeline cluster and observed the following:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;STRONG&gt;CPU Utilization:&lt;/STRONG&gt; Does not exceed 10%.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Memory Utilization:&lt;/STRONG&gt; Approximately 40-45 GB out of 64 GB.&lt;/LI&gt;&lt;LI&gt;&lt;STRONG&gt;Network Traffic:&lt;/STRONG&gt; The "transmitted through network" graph shows spikes every 20-30 min, which match the time data is inserted into the bronze layer.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Does anyone have an idea why this batching behavior is occurring despite the continuous mode setting? What should I check or adjust to diagnose and resolve this issue to ensure more immediate insertion of data?&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;If any other specifications or information is needed please let me know.&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 29 May 2024 07:16:36 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-data-insertion-speed-for-json-files-in-dlt-pipeline/m-p/70986#M34199</guid>
      <dc:creator>MiBjorn</dc:creator>
      <dc:date>2024-05-29T07:16:36Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Data Insertion Speed for JSON Files in DLT Pipeline</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-data-insertion-speed-for-json-files-in-dlt-pipeline/m-p/70996#M34204</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;&amp;nbsp;The DLT pipeline use Core product edition, as there is no features used in the code that requires the pro/advanced edition. The pipeline runs without error, the issue, as mentioned above, is that it can take up to 30 min before a row is inserted when the pipeline is set to continuous mode, which I find a bit strange.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 29 May 2024 09:52:53 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-data-insertion-speed-for-json-files-in-dlt-pipeline/m-p/70996#M34204</guid>
      <dc:creator>MiBjorn</dc:creator>
      <dc:date>2024-05-29T09:52:53Z</dc:date>
    </item>
  </channel>
</rss>

