Hi, I'm running a structured streaming job on a pipeline with a medallion architecture.
In my silver layer, we are reading from the bronze layer using structured streaming, and writing the stream to the silver layer w/ a foreachbatch function doing some transformations and a merge into operation. In addition, there are about 60 delta tables being processed in this one workflow, so we utilized pools and multithreading to run these jobs concurrently using a 16 core driver across 10 workers w/ 4 cores each.
Currently the silver layer processing takes 8 minutes, which is longer than we expect since a lot of the tables have 0 updates or inserts. Looking at the spark ui I noticed a large downtime between the bulk of our jobs and some other jobs. From the screen grab below, it looks like the driver is doing some work here which is causing the job to take an extra 4 minutes. I dug into what the last jobs are processing and it is a merge operation for a small amount of records (<1000) into delta table (~16gb) that seems to be holding up the job. We've run OPTIMIZE + VACCUUM on all our silver layer delta tables.
Hoping someone has some suggestions for optimization or can give some context on what the driver is doing exactly during the downtime.
Thanks