cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Reading a large zip file containing NDJson file in Databricks

surajtr
New Contributor

Hi,

We have a 5 GB ZIP file stored in ADLS. When uncompressed, it expands to approximately 115 GB and contains multiple NDJSON files, each around 200 MB in size. We need to read this data and write it to a Delta table in Databricks on a weekly basis.

What would be the most optimal approach and recommended cluster configuration to efficiently handle this workload?

1 REPLY 1

chetan-mali
New Contributor III

Unzip the Archive File
Apache Spark cannot directly read compressed ZIP archives, so the first step is to decompress the 5 GB file. Since the uncompressed size is substantial (115 GB), the process must be handled carefully to avoid overwhelming the driver node's local storage.

  • Copy to driver using Databricks Utilities (dbutils) to copy the ZIP file from ADLS to the ephemeral storage of the cluster's driver node.
  • Decompress to a Distributed Location using the %sh magic command to execute the unzip command. Crucially, direct the output of the unzip command to a separate location on your mounted ADLS container or a Unity Catalog Volume. This prevents the 115 GB of uncompressed files from filling up the driver's limited local disk.

Ingest NDJSON Files with Auto Loader
Once unzipped, you will have numerous 200 MB NDJSON files. Databricks Auto Loader is the ideal tool for ingesting these files into a Delta table. It is more scalable and robust than manually reading files, as it can track ingested files and handle schema variations automatically.

 

Driver NodeStandard_DS4_v2 (8 Cores, 28 GB RAM) or similarA reasonably powerful driver is needed to handle the unzipping process of the 5 GB file.
Worker NodesType: Storage Optimized (e.g., Standard_L8s_v3) or General Purpose (e.g., Standard_DS4_v2)
Workers: Min: 4, Max: 16
Storage Optimized instances are ideal for I/O-heavy ETL jobs. Start with a modest range for autoscaling and adjust based on performance monitoring during the initial runs. The ~115 GB of data will be split into roughly 920 partitions (at 128 MB each), which can be processed in parallel across the workers.