cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
cancel
Showing results for 
Search instead for 
Did you mean: 

When running structured streaming jobs in production, what are the general best practices to reduce cost?

User16752245312
New Contributor III

Consider a basic structured streaming use case of aggregating the data, perform some basic data cleaning transformation, and merge into a historical aggregate dataset.

5 REPLIES 5

User16826994223
Honored Contributor III

what I can think of is

1. put trigger processing t​ime some interval rather than continuos.The Api hit of checkpoint storage increase cost,not dbus but for cloud vendor

2.If you have multiple streams then multiplex multiple streams into one,rather than different cluster for different streams.

brickster_2018
Esteemed Contributor
Esteemed Contributor
  • There is always a trade-off between cost and batch execution time. It's possible to launch a small cluster and limit the data per batch and run it successfully. However, there is a chance the Streaming workload will develop backlog. So choosing the right cluster size is important and the prime factor deciding the sizing should be SLA for data availability rather than cost. but if the cost has more precedence then launching a small cluster will help.
  • As you are doing an aggregate operation, it can involve state management as well. If so, choosing the best state store can also help to reduce unnecessary costs on Disk expansion.
  • As @Kunal Gaurav​  mentioned, you can plan to run multiple streams on an interactive cluster. However note that Streaming applications can be long-running and because of the fact the on-demand clusters are cheaper, running the workloads on an on-demand cluster could be cheaper.

Soma
Valued Contributor

This will help a lot pls ensure we follow these before moving to production

https://docs.databricks.com/spark/latest/structured-streaming/production.html

lawrence009
Contributor

I second the recommendations: auto load with trigger, batch processing instead of continuous streaming where use case permits. In addition,

  • test with a small batch first
  • favor fewer larger workers over more smaller workers
  • adjust your job cluster overtime, by looking at spark UI and cluster metrics to see where steps can be optimized and computing resources reduced

Meghala
Valued Contributor II

Yes correct one​

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!