DLT pipeline slow streaming (root cause needs to be identified)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-16-2023 01:49 AM
Dear support,
we have the following situation where a set of DLT pipelines are streaming with very low rate incoming data and we need to find the root cause of this delay.
In order to provide more insight about the setup of the DLT pipelines and some metrics regarding the source table :
-The source table has 63.000.077.072 records
-The source table has 2 partitions which are mapped to column values directly
-The source table has 4 partitions which are calculated values from column values
-The streaming query also filters the source table by filtering on one partitioned and one un-partitioned column at the same time
-Metrics of the partition targetted
partition,records
p1,2082775
p2,932645
p3,2808
p4,5
p5,2
p6,30990942
p7,80
p8,143623
p9,1735803700
p10,4819113815
p11,4749727822
p12,12491237547
p13,17198069143
p14,18333204664
p15,3638767501
-The partition of interest is p15 and holds 3.638.767.501 records
-The records of interest which need to be streamed after applying filtering on the partition column and the un-partitioned column are 76.929.237
-The following options are used while streaming :
option("maxBytesPerTrigger", 1024 * 1024 * <MB_PER_TRIGGER_PROPERTY>)
option("ignoreChanges", "true")
option("startingTimestamp", <CUT_OFF_PROPERTY>)
MB_PER_TRIGGER_PROPERTY=10
CUT_OFF_PROPERTY=a given date
-The DLT pipelines have the following specs in terms of processing power :
"node_type_id": "Standard_E8ds_v4",
"driver_node_type_id": "Standard_E8ds_v4",
"autoscale": {
"min_workers": 1,
"max_workers": 1,
"mode": "LEGACY"
}
"photon": false
The problem observed is the following :
The speed with which data is stored on the destination tables is very low. For instance : 2 million records have reached the destination table in 50+hours of streaming.
Note : There are 4 DLT pipelines streaming concurrently from the same source table and appending to different destination tables.
Best regards,
EDDatabricks
- Labels:
-
DLT
-
DLT Pipeline

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-24-2023 11:41 PM
@EDDatabricks EDDatabricks :
Based on the provided information, there could be several factors contributing to the slow streaming speed:
- Data volume: The source table has over 63 billion records, and the partition of interest (p15) holds over 3.6 billion records. It's possible that the sheer volume of data being processed is slowing down the streaming.
- Processing power: The node type used for the DLT pipelines is Standard_E8ds_v4, which has 8 vCPUs, 64 GiB memory, and 2,400 MB/s disk throughput. It's possible that the processing power is not sufficient to handle the volume of data being streamed.
- Network bandwidth: The streaming data needs to be transmitted over the network to the destination tables. If the network bandwidth is limited, it could slow down the streaming speed.
- Data filtering: The streaming query is filtering on one partitioned and one un-partitioned column at the same time. Depending on the complexity of the filtering logic, it could slow down the streaming speed.
To improve the streaming performance, here are some suggestions:
- Increase the processing power of the DLT pipelines by using a more powerful node type, such as the Standard_E16ds_v4 or the Standard_E32ds_v4.
- Increase the maxBytesPerTrigger option to allow more data to be processed in each trigger. However, increasing this option too much could cause memory issues, so it's important to monitor the memory usage.
- Optimize the data filtering logic to make it more efficient. For example, consider partitioning the data differently or using a different column for filtering.
- Check the network bandwidth and consider increasing it if it's limiting the streaming speed.
- Consider using delta tables, which can improve the performance of streaming and querying large datasets.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-25-2023 03:45 AM
Hi @EDDatabricks EDDatabricks
Thank you for posting your question in our community! We are happy to assist you.
To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?
This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance!

