Databricks Community

djfliu · ‎10-03-2022

Hi, I'm running a structured streaming job on a pipeline with a medallion architecture.

In my silver layer, we are reading from the bronze layer using structured streaming, and writing the stream to the silver layer w/ a foreachbatch function doing some transformations and a merge into operation. In addition, there are about 60 delta tables being processed in this one workflow, so we utilized pools and multithreading to run these jobs concurrently using a 16 core driver across 10 workers w/ 4 cores each.

Currently the silver layer processing takes 8 minutes, which is longer than we expect since a lot of the tables have 0 updates or inserts. Looking at the spark ui I noticed a large downtime between the bulk of our jobs and some other jobs. From the screen grab below, it looks like the driver is doing some work here which is causing the job to take an extra 4 minutes. I dug into what the last jobs are processing and it is a merge operation for a small amount of records (<1000) into delta table (~16gb) that seems to be holding up the job. We've run OPTIMIZE + VACCUUM on all our silver layer delta tables.

Hoping someone has some suggestions for optimization or can give some context on what the driver is doing exactly during the downtime.

Thanks

Hubert-Dudek · ‎10-03-2022

It is a single file that has 16GB? Maybe it could be good to split it into some partitions with sense (partitions on the disk, so not everything needs to be read or written to).
If you have to write that 16GB data at once, please check that before the write is repartitioned (partitions on spark) to the number of cores (so 40) so every core process chunk of data, and then check are that 40 files similar in size (to avoid data skews)
Check for data spills in Spark UI as they mean writing shuffle partitions from RAM to disks and back. (25th, 50th, and 75th percentile should be similar). Increase shuffle partitions if they have to be frequently written to disk.

djfliu · ‎10-03-2022

Ah sorry Hubert, after re-reading my post I can see where I casued some confusion. I updated my description to better reflect my issue.

Basically, the job isn't processing 16gb of data, it is just merging data (using the delta merge operation) into a delta table that is of size 16gb partitioned across ~200files.

The data being merged is very small (~500-1000records).

Anonymous · ‎10-30-2022

Hi @Danny Liu

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!