cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

When running structured streaming jobs in production, what are the general best practices to reduce cost?

User16752245312
New Contributor III

Consider a basic structured streaming use case of aggregating the data, perform some basic data cleaning transformation, and merge into a historical aggregate dataset.

5 REPLIES 5

User16826994223
Honored Contributor III

what I can think of is

1. put trigger processing tโ€‹ime some interval rather than continuos.The Api hit of checkpoint storage increase cost,not dbus but for cloud vendor

2.If you have multiple streams then multiplex multiple streams into one,rather than different cluster for different streams.

โ€‹

โ€‹

brickster_2018
Esteemed Contributor
  • There is always a trade-off between cost and batch execution time. It's possible to launch a small cluster and limit the data per batch and run it successfully. However, there is a chance the Streaming workload will develop backlog. So choosing the right cluster size is important and the prime factor deciding the sizing should be SLA for data availability rather than cost. but if the cost has more precedence then launching a small cluster will help.
  • As you are doing an aggregate operation, it can involve state management as well. If so, choosing the best state store can also help to reduce unnecessary costs on Disk expansion.
  • As @Kunal Gauravโ€‹  mentioned, you can plan to run multiple streams on an interactive cluster. However note that Streaming applications can be long-running and because of the fact the on-demand clusters are cheaper, running the workloads on an on-demand cluster could be cheaper.

Soma
Valued Contributor

This will help a lot pls ensure we follow these before moving to production

https://docs.databricks.com/spark/latest/structured-streaming/production.html

lawrence009
Contributor

I second the recommendations: auto load with trigger, batch processing instead of continuous streaming where use case permits. In addition,

  • test with a small batch first
  • favor fewer larger workers over more smaller workers
  • adjust your job cluster overtime, by looking at spark UI and cluster metrics to see where steps can be optimized and computing resources reduced

Meghala
Valued Contributor II

Yes correct oneโ€‹

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group