Background:
I'm working on a data pipeline to insert JSON files as quickly as possible. Here are the details of my setup:
File Size: 1.5 - 2 kB each
File Volume: Approximately 30,000 files per hour
Pipeline: Using Databricks Delta Live Tables (DLT) in continuous pipeline mode
Cluster Configuration:
Type: Standard_E8ds_v4 (both worker and driver)
Enhanced Autoscaling: Enabled with 1-5 workers
Storage: Files are received continuously in an Azure Data Lake Storage (ADLS) Gen2 container
Pre-processing: JSON files are flattened before being inserted into the bronze layer
Issue:
Despite the continuous reception of files, the records appear to be inserted into the bronze layer in batches every 20-30 minutes. This delay is causing issues with the timeliness of data processing.
Observations:
I reviewed the metrics for the DLT pipeline cluster and observed the following:
- CPU Utilization: Does not exceed 10%.
- Memory Utilization: Approximately 40-45 GB out of 64 GB.
- Network Traffic: The "transmitted through network" graph shows spikes every 20-30 min, which match the time data is inserted into the bronze layer.
Does anyone have an idea why this batching behavior is occurring despite the continuous mode setting? What should I check or adjust to diagnose and resolve this issue to ensure more immediate insertion of data?
If any other specifications or information is needed please let me know.