Spark Structured Streaming : Data write is too slo...

UmaMahesh1 · ‎11-29-2022

I'm a bit new to spark structured streaming stuff so do ask all the relevant questions if I missed any.

I have a notebook which consumes the events from a kafka topic and writes those records into adls. The topic is json serialized so I'm just writing that value column as a json string as it is into adls without flattening. (I would be doing the flattening part later in a separate notebook providing schema).

For the Writer for each batch, I set the trigger to availableNow = true. The maxOffsetsPerTrigger is set to be 5000. While consuming the data, I'm also adding a current_timestamp column to identify when each event was being consumed. I would be running the notebook once a day as I don't want to overbill the resources.

Since availableNow as true processes all the available data since the last checkpoint in micro batches, my expectation was that there would be chunks of close to 5000 records written in adls. But what I'm finding is that a random number of entries ranging from 1 to 1000 are being written at a time interval of 5-15 mins which I'm able to identify using the current_timestamp I added while reading the topic.

E.g. 20-11-2022 10:30:00:000 - 10 records

20-11-2022 10:35:00:000 - 1 record

20-11-2022 10:45:00:000 - 250 records .....

Because of this weird processing, if in 1 day the topic producer produced around 2000 events, it takes around 45 mins to 1 hour to consume and load the data into adls.

Can anyone explain why this is happening. Running a pipeline job for 1 hour to load 1000 records definitely seems to be an overkill.

P.S. this has nothing to do with cluster as the cluster itself is a very high performance one in production environment.

Uma Mahesh D

Spark Structured Streaming : Data write is too slow into adls.