<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Reading a large zip file containing NDJson file in Databricks in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/reading-a-large-zip-file-containing-ndjson-file-in-databricks/m-p/126775#M47769</link>
    <description>&lt;P&gt;&lt;STRONG&gt;Unzip the Archive File&lt;/STRONG&gt;&lt;BR /&gt;Apache Spark cannot directly read compressed ZIP archives, so the first step is to decompress the 5 GB file. Since the uncompressed size is substantial (115 GB), the process must be handled carefully to avoid overwhelming the driver node's local storage.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Copy to driver using Databricks Utilities (dbutils) to copy the ZIP file from ADLS to the ephemeral storage of the cluster's driver node.&lt;/LI&gt;&lt;LI&gt;Decompress to a Distributed Location using the %sh magic command to execute the unzip command. Crucially, direct the output of the unzip command to a separate location on your mounted ADLS container or a Unity Catalog Volume. This prevents the 115 GB of uncompressed files from filling up the driver's limited local disk.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Ingest NDJSON Files with Auto Loader&lt;/STRONG&gt;&lt;BR /&gt;Once unzipped, you will have numerous 200 MB NDJSON files. Databricks Auto Loader is the ideal tool for ingesting these files into a Delta table. It is more scalable and robust than manually reading files, as it can track ingested files and handle schema variations automatically.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRONG&gt;Driver Node&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;Standard_DS4_v2 (8 Cores, 28 GB RAM) or similar&lt;/TD&gt;&lt;TD&gt;A reasonably powerful driver is needed to handle the unzipping process of the 5 GB file.&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRONG&gt;Worker Nodes&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;Type&lt;/STRONG&gt;: Storage Optimized (e.g., Standard_L8s_v3) or General Purpose (e.g., Standard_DS4_v2)&lt;BR /&gt;&lt;STRONG&gt;Workers&lt;/STRONG&gt;: Min: 4, Max: 16&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;Storage Optimized&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instances are ideal for I/O-heavy ETL jobs&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Start with a modest range for autoscaling and adjust based on performance monitoring during the initial runs&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;The ~115 GB of data will be split into roughly 920 partitions (at 128 MB each), which can be processed in parallel across the workers&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
    <pubDate>Tue, 29 Jul 2025 08:53:09 GMT</pubDate>
    <dc:creator>chetan-mali</dc:creator>
    <dc:date>2025-07-29T08:53:09Z</dc:date>
    <item>
      <title>Reading a large zip file containing NDJson file in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/reading-a-large-zip-file-containing-ndjson-file-in-databricks/m-p/126730#M47752</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;We have a 5 GB ZIP file stored in ADLS. When uncompressed, it expands to approximately 115 GB and contains multiple NDJSON files, each around 200 MB in size. We need to read this data and write it to a Delta table in Databricks on a weekly basis.&lt;/P&gt;&lt;P&gt;What would be the most optimal approach and recommended cluster configuration to efficiently handle this workload?&lt;/P&gt;</description>
      <pubDate>Mon, 28 Jul 2025 16:27:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/reading-a-large-zip-file-containing-ndjson-file-in-databricks/m-p/126730#M47752</guid>
      <dc:creator>surajtr</dc:creator>
      <dc:date>2025-07-28T16:27:10Z</dc:date>
    </item>
    <item>
      <title>Re: Reading a large zip file containing NDJson file in Databricks</title>
      <link>https://community.databricks.com/t5/data-engineering/reading-a-large-zip-file-containing-ndjson-file-in-databricks/m-p/126775#M47769</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Unzip the Archive File&lt;/STRONG&gt;&lt;BR /&gt;Apache Spark cannot directly read compressed ZIP archives, so the first step is to decompress the 5 GB file. Since the uncompressed size is substantial (115 GB), the process must be handled carefully to avoid overwhelming the driver node's local storage.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Copy to driver using Databricks Utilities (dbutils) to copy the ZIP file from ADLS to the ephemeral storage of the cluster's driver node.&lt;/LI&gt;&lt;LI&gt;Decompress to a Distributed Location using the %sh magic command to execute the unzip command. Crucially, direct the output of the unzip command to a separate location on your mounted ADLS container or a Unity Catalog Volume. This prevents the 115 GB of uncompressed files from filling up the driver's limited local disk.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;STRONG&gt;Ingest NDJSON Files with Auto Loader&lt;/STRONG&gt;&lt;BR /&gt;Once unzipped, you will have numerous 200 MB NDJSON files. Databricks Auto Loader is the ideal tool for ingesting these files into a Delta table. It is more scalable and robust than manually reading files, as it can track ingested files and handle schema variations automatically.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;TABLE&gt;&lt;TBODY&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRONG&gt;Driver Node&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;Standard_DS4_v2 (8 Cores, 28 GB RAM) or similar&lt;/TD&gt;&lt;TD&gt;A reasonably powerful driver is needed to handle the unzipping process of the 5 GB file.&lt;/TD&gt;&lt;/TR&gt;&lt;TR&gt;&lt;TD&gt;&lt;STRONG&gt;Worker Nodes&lt;/STRONG&gt;&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;Type&lt;/STRONG&gt;: Storage Optimized (e.g., Standard_L8s_v3) or General Purpose (e.g., Standard_DS4_v2)&lt;BR /&gt;&lt;STRONG&gt;Workers&lt;/STRONG&gt;: Min: 4, Max: 16&lt;/TD&gt;&lt;TD&gt;&lt;STRONG&gt;Storage Optimized&lt;/STRONG&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;instances are ideal for I/O-heavy ETL jobs&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;Start with a modest range for autoscaling and adjust based on performance monitoring during the initial runs&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;The ~115 GB of data will be split into roughly 920 partitions (at 128 MB each), which can be processed in parallel across the workers&lt;SPAN class=""&gt;.&lt;/SPAN&gt;&lt;/TD&gt;&lt;/TR&gt;&lt;/TBODY&gt;&lt;/TABLE&gt;</description>
      <pubDate>Tue, 29 Jul 2025 08:53:09 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/reading-a-large-zip-file-containing-ndjson-file-in-databricks/m-p/126775#M47769</guid>
      <dc:creator>chetan-mali</dc:creator>
      <dc:date>2025-07-29T08:53:09Z</dc:date>
    </item>
  </channel>
</rss>

