Dear support,
we have the following situation where a set of DLT pipelines are streaming with very low rate incoming data and we need to find the root cause of this delay.
In order to provide more insight about the setup of the DLT pipelines and some metrics regarding the source table :
-The source table has 63.000.077.072 records
-The source table has 2 partitions which are mapped to column values directly
-The source table has 4 partitions which are calculated values from column values
-The streaming query also filters the source table by filtering on one partitioned and one un-partitioned column at the same time
-Metrics of the partition targetted
partition,records
p1,2082775
p2,932645
p3,2808
p4,5
p5,2
p6,30990942
p7,80
p8,143623
p9,1735803700
p10,4819113815
p11,4749727822
p12,12491237547
p13,17198069143
p14,18333204664
p15,3638767501
-The partition of interest is p15 and holds 3.638.767.501 records
-The records of interest which need to be streamed after applying filtering on the partition column and the un-partitioned column are 76.929.237
-The following options are used while streaming :
option("maxBytesPerTrigger", 1024 * 1024 * <MB_PER_TRIGGER_PROPERTY>)
option("ignoreChanges", "true")
option("startingTimestamp", <CUT_OFF_PROPERTY>)
MB_PER_TRIGGER_PROPERTY=10
CUT_OFF_PROPERTY=a given date
-The DLT pipelines have the following specs in terms of processing power :
"node_type_id": "Standard_E8ds_v4",
"driver_node_type_id": "Standard_E8ds_v4",
"autoscale": {
"min_workers": 1,
"max_workers": 1,
"mode": "LEGACY"
}
"photon": false
The problem observed is the following :
The speed with which data is stored on the destination tables is very low. For instance : 2 million records have reached the destination table in 50+hours of streaming.
Note : There are 4 DLT pipelines streaming concurrently from the same source table and appending to different destination tables.
Best regards,
EDDatabricks