Reading a large zip file containing NDJson file in Databricks

surajtr — Mon, 28 Jul 2025 16:27:10 GMT

Hi,

We have a 5 GB ZIP file stored in ADLS. When uncompressed, it expands to approximately 115 GB and contains multiple NDJSON files, each around 200 MB in size. We need to read this data and write it to a Delta table in Databricks on a weekly basis.

What would be the most optimal approach and recommended cluster configuration to efficiently handle this workload?

Re: Reading a large zip file containing NDJson file in Databricks

chetan-mali — Tue, 29 Jul 2025 08:53:09 GMT

Unzip the Archive File
Apache Spark cannot directly read compressed ZIP archives, so the first step is to decompress the 5 GB file. Since the uncompressed size is substantial (115 GB), the process must be handled carefully to avoid overwhelming the driver node's local storage.

Copy to driver using Databricks Utilities (dbutils) to copy the ZIP file from ADLS to the ephemeral storage of the cluster's driver node.
Decompress to a Distributed Location using the %sh magic command to execute the unzip command. Crucially, direct the output of the unzip command to a separate location on your mounted ADLS container or a Unity Catalog Volume. This prevents the 115 GB of uncompressed files from filling up the driver's limited local disk.

Ingest NDJSON Files with Auto Loader
Once unzipped, you will have numerous 200 MB NDJSON files. Databricks Auto Loader is the ideal tool for ingesting these files into a Delta table. It is more scalable and robust than manually reading files, as it can track ingested files and handle schema variations automatically.

Driver Node	Standard_DS4_v2 (8 Cores, 28 GB RAM) or similar	A reasonably powerful driver is needed to handle the unzipping process of the 5 GB file.
Worker Nodes	Type: Storage Optimized (e.g., Standard_L8s_v3) or General Purpose (e.g., Standard_DS4_v2) Workers: Min: 4, Max: 16	Storage Optimized instances are ideal for I/O-heavy ETL jobs. Start with a modest range for autoscaling and adjust based on performance monitoring during the initial runs. The ~115 GB of data will be split into roughly 920 partitions (at 128 MB each), which can be processed in parallel across the workers.

topic Re: Reading a large zip file containing NDJson file in Databricks in Data Engineering

Reading a large zip file containing NDJson file in Databricks

Re: Reading a large zip file containing NDJson file in Databricks