cancel
Showing results for 
Search instead for 
Did you mean: 
Machine Learning
cancel
Showing results for 
Search instead for 
Did you mean: 

When running structured streaming jobs in production, what are the general best practices to reduce cost?

User16752245312
New Contributor III

Consider a basic structured streaming use case of aggregating the data, perform some basic data cleaning transformation, and merge into a historical aggregate dataset.

5 REPLIES 5

User16826994223
Honored Contributor III

what I can think of is

1. put trigger processing t​ime some interval rather than continuos.The Api hit of checkpoint storage increase cost,not dbus but for cloud vendor

2.If you have multiple streams then multiplex multiple streams into one,rather than different cluster for different streams.

User16869510359
Esteemed Contributor
  • There is always a trade-off between cost and batch execution time. It's possible to launch a small cluster and limit the data per batch and run it successfully. However, there is a chance the Streaming workload will develop backlog. So choosing the right cluster size is important and the prime factor deciding the sizing should be SLA for data availability rather than cost. but if the cost has more precedence then launching a small cluster will help.
  • As you are doing an aggregate operation, it can involve state management as well. If so, choosing the best state store can also help to reduce unnecessary costs on Disk expansion.
  • As @Kunal Gaurav​  mentioned, you can plan to run multiple streams on an interactive cluster. However note that Streaming applications can be long-running and because of the fact the on-demand clusters are cheaper, running the workloads on an on-demand cluster could be cheaper.

Soma
Valued Contributor

This will help a lot pls ensure we follow these before moving to production

https://docs.databricks.com/spark/latest/structured-streaming/production.html

lawrence009
Contributor

I second the recommendations: auto load with trigger, batch processing instead of continuous streaming where use case permits. In addition,

  • test with a small batch first
  • favor fewer larger workers over more smaller workers
  • adjust your job cluster overtime, by looking at spark UI and cluster metrics to see where steps can be optimized and computing resources reduced

Meghala
Valued Contributor II

Yes correct one​

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.