cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT pipeline slow streaming (root cause needs to be identified)

EDDatabricks
Contributor

Dear support,

we have the following situation where a set of DLT pipelines are streaming with very low rate incoming data and we need to find the root cause of this delay.

In order to provide more insight about the setup of the DLT pipelines and some metrics regarding the source table :

-The source table has 63.000.077.072 records

-The source table has 2 partitions which are mapped to column values directly

-The source table has 4 partitions which are calculated values from column values

-The streaming query also filters the source table by filtering on one partitioned and one un-partitioned column at the same time

-Metrics of the partition targetted

partition,records

p1,2082775

p2,932645

p3,2808

p4,5

p5,2

p6,30990942

p7,80

p8,143623

p9,1735803700

p10,4819113815

p11,4749727822

p12,12491237547

p13,17198069143

p14,18333204664

p15,3638767501

-The partition of interest is p15 and holds 3.638.767.501 records

-The records of interest which need to be streamed after applying filtering on the partition column and the un-partitioned column are 76.929.237

-The following options are used while streaming :

option("maxBytesPerTrigger", 1024 * 1024 * <MB_PER_TRIGGER_PROPERTY>)

option("ignoreChanges", "true")

option("startingTimestamp", <CUT_OFF_PROPERTY>)

MB_PER_TRIGGER_PROPERTY=10

CUT_OFF_PROPERTY=a given date

-The DLT pipelines have the following specs in terms of processing power :

"node_type_id": "Standard_E8ds_v4",

"driver_node_type_id": "Standard_E8ds_v4",

"autoscale": {

   "min_workers": 1,

   "max_workers": 1,

   "mode": "LEGACY"

}

"photon": false

The problem observed is the following :

The speed with which data is stored on the destination tables is very low. For instance : 2 million records have reached the destination table in 50+hours of streaming.

Note : There are 4 DLT pipelines streaming concurrently from the same source table and appending to different destination tables.

Best regards,

EDDatabricks

2 REPLIES 2

Anonymous
Not applicable

@EDDatabricks EDDatabricks​ :

Based on the provided information, there could be several factors contributing to the slow streaming speed:

  1. Data volume: The source table has over 63 billion records, and the partition of interest (p15) holds over 3.6 billion records. It's possible that the sheer volume of data being processed is slowing down the streaming.
  2. Processing power: The node type used for the DLT pipelines is Standard_E8ds_v4, which has 8 vCPUs, 64 GiB memory, and 2,400 MB/s disk throughput. It's possible that the processing power is not sufficient to handle the volume of data being streamed.
  3. Network bandwidth: The streaming data needs to be transmitted over the network to the destination tables. If the network bandwidth is limited, it could slow down the streaming speed.
  4. Data filtering: The streaming query is filtering on one partitioned and one un-partitioned column at the same time. Depending on the complexity of the filtering logic, it could slow down the streaming speed.

To improve the streaming performance, here are some suggestions:

  1. Increase the processing power of the DLT pipelines by using a more powerful node type, such as the Standard_E16ds_v4 or the Standard_E32ds_v4.
  2. Increase the maxBytesPerTrigger option to allow more data to be processed in each trigger. However, increasing this option too much could cause memory issues, so it's important to monitor the memory usage.
  3. Optimize the data filtering logic to make it more efficient. For example, consider partitioning the data differently or using a different column for filtering.
  4. Check the network bandwidth and consider increasing it if it's limiting the streaming speed.
  5. Consider using delta tables, which can improve the performance of streaming and querying large datasets.

Anonymous
Not applicable

Hi @EDDatabricks EDDatabricks​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.