Databricks Community

MiBjorn · ‎05-29-2024

Background:

I'm working on a data pipeline to insert JSON files as quickly as possible. Here are the details of my setup:

File Size: 1.5 - 2 kB each

File Volume: Approximately 30,000 files per hour

Pipeline: Using Databricks Delta Live Tables (DLT) in continuous pipeline mode

Cluster Configuration:

Type: Standard_E8ds_v4 (both worker and driver)

Enhanced Autoscaling: Enabled with 1-5 workers

Storage: Files are received continuously in an Azure Data Lake Storage (ADLS) Gen2 container

Pre-processing: JSON files are flattened before being inserted into the bronze layer

Issue:

Despite the continuous reception of files, the records appear to be inserted into the bronze layer in batches every 20-30 minutes. This delay is causing issues with the timeliness of data processing.

Observations:

I reviewed the metrics for the DLT pipeline cluster and observed the following:

CPU Utilization: Does not exceed 10%.
Memory Utilization: Approximately 40-45 GB out of 64 GB.
Network Traffic: The "transmitted through network" graph shows spikes every 20-30 min, which match the time data is inserted into the bronze layer.

Does anyone have an idea why this batching behavior is occurring despite the continuous mode setting? What should I check or adjust to diagnose and resolve this issue to ensure more immediate insertion of data?

If any other specifications or information is needed please let me know.

Kaniz_Fatma · ‎05-29-2024

Hi @MiBjorn,

Confirm that you're using the appropriate DLT product edition (Core, Pro, or Advanced) based on your...¹.
You'll receive an error message if your pipeline includes features that are not supported by the selected edition.

Feel free to provide additional information or ask follow-up questions if needed! 😊

MiBjorn · ‎05-29-2024

@Kaniz_Fatma The DLT pipeline use Core product edition, as there is no features used in the code that requires the pro/advanced edition. The pipeline runs without error, the issue, as mentioned above, is that it can take up to 30 min before a row is inserted when the pipeline is set to continuous mode, which I find a bit strange.

Databricks Community

Optimizing Data Insertion Speed for JSON Files in DLT Pipeline

🔔 ALERT: Act Now to Protect Your Community Account; Secure Your Details Before It's Too Late!

Databricks Learning Festival (Virtual): 10 July - 24 July 2024

Data + AI Summit 2024: An Executive Summary for Data Leaders

Big Data Is Back and Is More Important Than AI