cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimizing Data Insertion Speed for JSON Files in DLT Pipeline

MiBjorn
New Contributor II

Background:

I'm working on a data pipeline to insert JSON files as quickly as possible. Here are the details of my setup:

 

File Size: 1.5 - 2 kB each

File Volume: Approximately 30,000 files per hour

Pipeline: Using Databricks Delta Live Tables (DLT) in continuous pipeline mode

Cluster Configuration:

     Type: Standard_E8ds_v4 (both worker and driver)

     Enhanced Autoscaling: Enabled with 1-5 workers

Storage: Files are received continuously in an Azure Data Lake Storage (ADLS) Gen2 container

Pre-processing: JSON files are flattened before being inserted into the bronze layer

 

Issue:

Despite the continuous reception of files, the records appear to be inserted into the bronze layer in batches every 20-30 minutes. This delay is causing issues with the timeliness of data processing.

 

Observations:

I reviewed the metrics for the DLT pipeline cluster and observed the following:

  • CPU Utilization: Does not exceed 10%.
  • Memory Utilization: Approximately 40-45 GB out of 64 GB.
  • Network Traffic: The "transmitted through network" graph shows spikes every 20-30 min, which match the time data is inserted into the bronze layer.

 

Does anyone have an idea why this batching behavior is occurring despite the continuous mode setting? What should I check or adjust to diagnose and resolve this issue to ensure more immediate insertion of data?

If any other specifications or information is needed please let me know.

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @MiBjorn,

Feel free to provide additional information or ask follow-up questions if needed! 😊

 

MiBjorn
New Contributor II

@Kaniz The DLT pipeline use Core product edition, as there is no features used in the code that requires the pro/advanced edition. The pipeline runs without error, the issue, as mentioned above, is that it can take up to 30 min before a row is inserted when the pipeline is set to continuous mode, which I find a bit strange. 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!