<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Optimizing Delta Live Table Ingestion Performance for Large JSON Datasets in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65711#M32902</link>
    <description>&lt;P&gt;In addition to Kanzi suggestion. I noticed that you are using (Standard_DS3_v2) this worker might be too small for this job. If you can afford increasing to a larger worker, you should consider that.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Sun, 07 Apr 2024 01:46:57 GMT</pubDate>
    <dc:creator>standup1</dc:creator>
    <dc:date>2024-04-07T01:46:57Z</dc:date>
    <item>
      <title>Optimizing Delta Live Table Ingestion Performance for Large JSON Datasets</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65336#M32786</link>
      <description>&lt;P&gt;I'm currently facing challenges with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;optimizing the performance of a Delta Live Table pipeline in Azure Databricks&lt;/STRONG&gt;. The task involves ingesting&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;over 10 TB of raw JSON log files&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;from an Azure Data Lake Storage account into a bronze Delta Live Table layer. Notably,&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;the number of JSON files exceeds 500,000&lt;/STRONG&gt;. The table structure is quite wide, featuring&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;more than 4000 columns (out of over 10,000 in the source files)&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;and totaling&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;over 12 billion rows&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;In this process, I'm not performing any complex transformations, just appending a few columns for partitioning and log tracking. All data is ingested as strings. The Autoloader is configured with file notification mode enabled.&lt;/P&gt;&lt;P&gt;However, the initial data load is projected to take an&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;excessive amount of time&lt;/STRONG&gt;, estimated at over&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;20 days for all files&lt;/STRONG&gt;, which is far beyond acceptable limits. Below is a snippet of my current SQL notebook setup, and I'm open to transitioning to PySpark if it offers a better solution to this bottleneck.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Here is the code and pipeline settings that I'm using:&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;LI-CODE lang="python"&gt;CREATE OR REFRESH STREAMING LIVE TABLE `periodic_raw_poc`
PARTITIONED BY (device_id)
COMMENT "Ingest raw JSON data into a Delta Live Table with a predefined schema."
TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5',
    'delta.columnMapping.mode' = 'name'
)
AS SELECT 
regexp_extract(_metadata.file_path, '(\\w+_\\w+_\\w+_\\w+)', 2) AS log_name,
regexp_extract(_metadata.file_path, '(\\w+)_(\\w+)_(\\w+)_(\\w+)', 1) AS device_id,
*
FROM cloud_files(
  "abfss://&amp;lt;container-name&amp;gt;@&amp;lt;storage-account-name&amp;gt;.dfs.core.windows.net/RUNFILES/*/*.log.periodic.json",
  "json",
  map(
  "subscriptionId", "****",
  "tenantId", "****",
  "clientId", "****",
  "clientSecret", "****",
  "resourceGroup", "****",
  "useNotifications", "true",
  "fetchParallelism", "32",
  "schema", "`_timestamp` STRING,... (rest of the 4000 columns as STRING Type)
)&lt;/LI-CODE&gt;&lt;LI-CODE lang="javascript"&gt;{
    "id": "****",
    "pipeline_type": "WORKSPACE",
    "clusters": [
        {
            "label": "default",
            "node_type_id": "Standard_DS3_v2",
            "driver_node_type_id": "Standard_DS4_v2",
            "autoscale": {
                "min_workers": 1,
                "max_workers": 6,
                "mode": "ENHANCED"
            }
        }
    ],
    "development": true,
    "continuous": false,
    "channel": "PREVIEW",
    "photon": true,
    "libraries": [
        {
            "notebook": {
                "path": "SQL notebook path"
            }
        }
    ],
    "name": "****",
    "edition": "ADVANCED",
    "catalog": "****",
    "target": "****",
    "data_sampling": false
}&lt;/LI-CODE&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV class=""&gt;&lt;P&gt;Despite these settings, the process is much slower than anticipated. I'm looking for insights or optimization strategies from those who have tackled similar challenges with&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;STRONG&gt;Delta Live Tables&lt;/STRONG&gt;, especially concerning large-scale data ingestion.&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;</description>
      <pubDate>Tue, 02 Apr 2024 21:48:58 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65336#M32786</guid>
      <dc:creator>brian_zavareh</dc:creator>
      <dc:date>2024-04-02T21:48:58Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Delta Live Table Ingestion Performance for Large JSON Datasets</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65708#M32899</link>
      <description>&lt;P&gt;Hi &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/9"&gt;@Retired_mod&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;Thank you for the insightful suggestions on optimizing our Delta Live Table pipeline. I'm excited to apply your recommendations, especially around partitioning, clustering, and Autoloader configurations. I'll implement these changes and reach out if I have any more questions.&lt;/P&gt;&lt;P&gt;Cheers&lt;/P&gt;</description>
      <pubDate>Sat, 06 Apr 2024 20:25:44 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65708#M32899</guid>
      <dc:creator>brian_zavareh</dc:creator>
      <dc:date>2024-04-06T20:25:44Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Delta Live Table Ingestion Performance for Large JSON Datasets</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65711#M32902</link>
      <description>&lt;P&gt;In addition to Kanzi suggestion. I noticed that you are using (Standard_DS3_v2) this worker might be too small for this job. If you can afford increasing to a larger worker, you should consider that.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Sun, 07 Apr 2024 01:46:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65711#M32902</guid>
      <dc:creator>standup1</dc:creator>
      <dc:date>2024-04-07T01:46:57Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Delta Live Table Ingestion Performance for Large JSON Datasets</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65915#M32954</link>
      <description>&lt;P&gt;Thanks for the heads up&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/102950"&gt;@standup1&lt;/a&gt;. I agree I was using it for the purpose of POC and I will select bigger clusters for the main job. Do you know any good practice on how to select the number of clusters and their sizes for both worker and driver?&lt;/P&gt;</description>
      <pubDate>Tue, 09 Apr 2024 16:00:51 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65915#M32954</guid>
      <dc:creator>brian_zavareh</dc:creator>
      <dc:date>2024-04-09T16:00:51Z</dc:date>
    </item>
    <item>
      <title>Re: Optimizing Delta Live Table Ingestion Performance for Large JSON Datasets</title>
      <link>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65917#M32956</link>
      <description>&lt;P&gt;Hey &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/41741"&gt;@brian_zavareh&lt;/a&gt; , see this document. I hope this can help.&lt;/P&gt;&lt;P&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/databricks/compute/cluster-config-best-practices" target="_blank"&gt;https://learn.microsoft.com/en-us/azure/databricks/compute/cluster-config-best-practices&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Just keep in mind that there's some extra cost from Azure VM side, check your Azure Cost Analysis for more details. Use tags when you create your pipeline, so it will be easy for you to drill down to see that specific pipeline cost.&lt;/P&gt;</description>
      <pubDate>Tue, 09 Apr 2024 16:14:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/optimizing-delta-live-table-ingestion-performance-for-large-json/m-p/65917#M32956</guid>
      <dc:creator>standup1</dc:creator>
      <dc:date>2024-04-09T16:14:42Z</dc:date>
    </item>
  </channel>
</rss>

