Optimizing Data Insertion Speed for JSON Files in ...

MiBjorn · ‎05-29-2024

Background:

I'm working on a data pipeline to insert JSON files as quickly as possible. Here are the details of my setup:

File Size: 1.5 - 2 kB each

File Volume: Approximately 30,000 files per hour

Pipeline: Using Databricks Delta Live Tables (DLT) in continuous pipeline mode

Cluster Configuration:

Type: Standard_E8ds_v4 (both worker and driver)

Enhanced Autoscaling: Enabled with 1-5 workers

Storage: Files are received continuously in an Azure Data Lake Storage (ADLS) Gen2 container

Pre-processing: JSON files are flattened before being inserted into the bronze layer

Issue:

Despite the continuous reception of files, the records appear to be inserted into the bronze layer in batches every 20-30 minutes. This delay is causing issues with the timeliness of data processing.

Observations:

I reviewed the metrics for the DLT pipeline cluster and observed the following:

CPU Utilization: Does not exceed 10%.
Memory Utilization: Approximately 40-45 GB out of 64 GB.
Network Traffic: The "transmitted through network" graph shows spikes every 20-30 min, which match the time data is inserted into the bronze layer.

Does anyone have an idea why this batching behavior is occurring despite the continuous mode setting? What should I check or adjust to diagnose and resolve this issue to ensure more immediate insertion of data?

If any other specifications or information is needed please let me know.

Optimizing Data Insertion Speed for JSON Files in DLT Pipeline