cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Help optimizing large empty gaps where no executors are running jobs in Spark UI. Structured streaming writing.

djfliu
New Contributor III

Hi, I'm running a structured streaming job on a pipeline with a medallion architecture.

In my silver layer, we are reading from the bronze layer using structured streaming, and writing the stream to the silver layer w/ a foreachbatch function doing some transformations and a merge into operation. In addition, there are about 60 delta tables being processed in this one workflow, so we utilized pools and multithreading to run these jobs concurrently using a 16 core driver across 10 workers w/ 4 cores each.

Currently the silver layer processing takes 8 minutes, which is longer than we expect since a lot of the tables have 0 updates or inserts. Looking at the spark ui I noticed a large downtime between the bulk of our jobs and some other jobs. From the screen grab below, it looks like the driver is doing some work here which is causing the job to take an extra 4 minutes. I dug into what the last jobs are processing and it is a merge operation for a small amount of records (<1000) into delta table (~16gb) that seems to be holding up the job. We've run OPTIMIZE + VACCUUM on all our silver layer delta tables.

Hoping someone has some suggestions for optimization or can give some context on what the driver is doing exactly during the downtime.

Thanks

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III
  • It is a single file that has 16GB? Maybe it could be good to split it into some partitions with sense (partitions on the disk, so not everything needs to be read or written to).
  • If you have to write that 16GB data at once, please check that before the write is repartitioned (partitions on spark) to the number of cores (so 40) so every core process chunk of data, and then check are that 40 files similar in size (to avoid data skews)
  • Check for data spills in Spark UI as they mean writing shuffle partitions from RAM to disks and back. (25th, 50th, and 75th percentile should be similar). Increase shuffle partitions if they have to be frequently written to disk.

djfliu
New Contributor III

Ah sorry Hubert, after re-reading my post I can see where I casued some confusion. I updated my description to better reflect my issue.

Basically, the job isn't processing 16gb of data, it is just merging data (using the delta merge operation) into a delta table that is of size 16gb partitioned across ~200files.

The data being merged is very small (~500-1000records).

Anonymous
Not applicable

Hi @Danny Liu​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.