cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Optimizing Data Insertion Speed for JSON Files in DLT Pipeline

MiBjorn
New Contributor II

Background:

I'm working on a data pipeline to insert JSON files as quickly as possible. Here are the details of my setup:

 

File Size: 1.5 - 2 kB each

File Volume: Approximately 30,000 files per hour

Pipeline: Using Databricks Delta Live Tables (DLT) in continuous pipeline mode

Cluster Configuration:

     Type: Standard_E8ds_v4 (both worker and driver)

     Enhanced Autoscaling: Enabled with 1-5 workers

Storage: Files are received continuously in an Azure Data Lake Storage (ADLS) Gen2 container

Pre-processing: JSON files are flattened before being inserted into the bronze layer

 

Issue:

Despite the continuous reception of files, the records appear to be inserted into the bronze layer in batches every 20-30 minutes. This delay is causing issues with the timeliness of data processing.

 

Observations:

I reviewed the metrics for the DLT pipeline cluster and observed the following:

  • CPU Utilization: Does not exceed 10%.
  • Memory Utilization: Approximately 40-45 GB out of 64 GB.
  • Network Traffic: The "transmitted through network" graph shows spikes every 20-30 min, which match the time data is inserted into the bronze layer.

 

Does anyone have an idea why this batching behavior is occurring despite the continuous mode setting? What should I check or adjust to diagnose and resolve this issue to ensure more immediate insertion of data?

If any other specifications or information is needed please let me know.

1 REPLY 1

MiBjorn
New Contributor II

@Retired_mod The DLT pipeline use Core product edition, as there is no features used in the code that requires the pro/advanced edition. The pipeline runs without error, the issue, as mentioned above, is that it can take up to 30 min before a row is inserted when the pipeline is set to continuous mode, which I find a bit strange. 

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group