Unzip the Archive File
Apache Spark cannot directly read compressed ZIP archives, so the first step is to decompress the 5 GB file. Since the uncompressed size is substantial (115 GB), the process must be handled carefully to avoid overwhelming the driver node's local storage.
- Copy to driver using Databricks Utilities (dbutils) to copy the ZIP file from ADLS to the ephemeral storage of the cluster's driver node.
- Decompress to a Distributed Location using the %sh magic command to execute the unzip command. Crucially, direct the output of the unzip command to a separate location on your mounted ADLS container or a Unity Catalog Volume. This prevents the 115 GB of uncompressed files from filling up the driver's limited local disk.
Ingest NDJSON Files with Auto Loader
Once unzipped, you will have numerous 200 MB NDJSON files. Databricks Auto Loader is the ideal tool for ingesting these files into a Delta table. It is more scalable and robust than manually reading files, as it can track ingested files and handle schema variations automatically.
Driver Node | Standard_DS4_v2 (8 Cores, 28 GB RAM) or similar | A reasonably powerful driver is needed to handle the unzipping process of the 5 GB file. |
Worker Nodes | Type: Storage Optimized (e.g., Standard_L8s_v3) or General Purpose (e.g., Standard_DS4_v2) Workers: Min: 4, Max: 16 | Storage Optimized instances are ideal for I/O-heavy ETL jobs. Start with a modest range for autoscaling and adjust based on performance monitoring during the initial runs. The ~115 GB of data will be split into roughly 920 partitions (at 128 MB each), which can be processed in parallel across the workers. |