Databricks Community

dataslicer · ‎04-14-2022

I currently have multiple jobs (each running its own job cluster) for my spark structured streaming pipelines that are long running 24x7x365 on DBR 9.x/10.x LTS. My SLAs are 24x7x365 with 1 minute latency.

I have already accomplished the following cost saving opportunities:

Using job cluster instead of general purpose compute
The trigger interval is processing at 1 minute interval
Using fair-share scheduler pools
Tuned the worker VM SKU type based on utilization

Given the above, are the following additional cost saving configurations proven* to meet the above streaming SLAs and supported** by Databricks?

Spot instances
Auto-scaling
The motivation for exploring these 2 cost saving options is because streaming data has different message volume (high and low) during different time of the day.
Any new additional cost saving options not mentioned so far are also welcome.

* Proven == empirical results in some large scale production scenario for some extended period of time to prove its robustness.

** Supported == Stateful streaming and recoveries supported by the current Spark 3.x APIs

For context, I have already applied the current (2022-04-14) best practices written by Databricks.

Any references for and against "Spot instances" and "Auto-scaling" are appreciated.

Thank you!

Anonymous · ‎04-14-2022

Autoscaling doesn't work with structured streaming, so that's not really an option. Autoscaling is based on jobs sitting in the jobs queue for a long time, but that's not the case with streaming. Streaming is more many frequent small jobs.

Spot instances should save money, but you do risk losing VMs if you are outbid. It can also be useful to purchase a lot of VMs ahead of time from the cloud provider. Usually, if you offer to buy many hours you can get a volume discount.

Photon enablement should speed things up and reduce your overall VM need. You'll use more DBUs but on a smaller cluster so you should still save money overall from less cloud costs.

View solution in original post

Anonymous · ‎04-14-2022

Autoscaling doesn't work with structured streaming, so that's not really an option. Autoscaling is based on jobs sitting in the jobs queue for a long time, but that's not the case with streaming. Streaming is more many frequent small jobs.

Spot instances should save money, but you do risk losing VMs if you are outbid. It can also be useful to purchase a lot of VMs ahead of time from the cloud provider. Usually, if you offer to buy many hours you can get a volume discount.

Photon enablement should speed things up and reduce your overall VM need. You'll use more DBUs but on a smaller cluster so you should still save money overall from less cloud costs.

dataslicer · ‎04-14-2022

Thank you for the extra perspectives!

Yes, there is already a volume discounts (VM resources purchased ahead of time) negotiated by the company and the cloud provider. Sorry I left that out as I was too focused on the technical options.

I am on the same page with you that both autoscaling and spot instances are not compatible with my structured streaming workloads and SLAs. For example, in the scenario of recovering from a "spot instance" being outbid, I have to account for the X amount of time for the next nodes to be available and Y amount of time for those nodes to complete the bootstrap sequence (getting imaged and added to my cluster). The variable time (sum of X + Y) would already have pushed my streaming workload out of SLA and source data backlog to build up.

Learning from you, the only viable variable that I have not explored so far is Photon.

Is there a general rule of thumb to understand how Photon can size down from non-Photon cluster/workload? For example, Photon reduce physical memory requirements by 20% from non-Photon workloads if the workloads are identical. But the CPU cores should remain the same. <-- Of course, this is a complete fabrication here, but the idea is I am looking for such mapping and translation so I know how to optimize my cluster VM sizing for Photon runtime. Any references are appreciated. <-- Happy to make this into a new question if the amount of effort is going to go beyond the scope of this original question. For example: how do I optimize my compute (streaming) workload when using Photon runtime. Please let me know. Thanks!

Anonymous · ‎04-15-2022

So much of what photon can do depends on what you're doing. If you're doing things that are very compatible with the sql engine and builtin functions, it's great. If you have python UDFs, then not so much. If you're doing delta read/writes then it's good. I would absolutely test it first, but in general things with photon should be about 1.8-2x faster, so you would only need 60% as many worker nodes.

When using photon, the SQL DAG will show up in yellow instead of the normal blue, so you can see exactly what it's doing.

dataslicer · ‎04-15-2022

Thank you so much for characterizing the Photon improvements in different scenarios!

I will definitely explore this new path.

Alexey · ‎04-25-2022

Hey,

can you provide more information on how you set up the fair-scheduler pools? I am currently trying to follow the instructions and provide the XML file with multiple pools (default is FIFO only), but I fail. 😕

Thanks in advance.

dataslicer · ‎05-11-2022

You might want to start a new question thread so there are enough area / space to capture the context and issues you are experiencing. That way you get better visibility and support from the community. For reference, this Spark documentation should have everything you need to get you started.