Re: DLT pipeline slow streaming (root cause needs ...

Anonymous · ‎03-24-2023

@EDDatabricks EDDatabricks :

Based on the provided information, there could be several factors contributing to the slow streaming speed:

Data volume: The source table has over 63 billion records, and the partition of interest (p15) holds over 3.6 billion records. It's possible that the sheer volume of data being processed is slowing down the streaming.
Processing power: The node type used for the DLT pipelines is Standard_E8ds_v4, which has 8 vCPUs, 64 GiB memory, and 2,400 MB/s disk throughput. It's possible that the processing power is not sufficient to handle the volume of data being streamed.
Network bandwidth: The streaming data needs to be transmitted over the network to the destination tables. If the network bandwidth is limited, it could slow down the streaming speed.
Data filtering: The streaming query is filtering on one partitioned and one un-partitioned column at the same time. Depending on the complexity of the filtering logic, it could slow down the streaming speed.

To improve the streaming performance, here are some suggestions:

Increase the processing power of the DLT pipelines by using a more powerful node type, such as the Standard_E16ds_v4 or the Standard_E32ds_v4.
Increase the maxBytesPerTrigger option to allow more data to be processed in each trigger. However, increasing this option too much could cause memory issues, so it's important to monitor the memory usage.
Optimize the data filtering logic to make it more efficient. For example, consider partitioning the data differently or using a different column for filtering.
Check the network bandwidth and consider increasing it if it's limiting the streaming speed.
Consider using delta tables, which can improve the performance of streaming and querying large datasets.