When running structured streaming jobs in production, what are the general best practices to reduce cost?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 01:07 PM
Consider a basic structured streaming use case of aggregating the data, perform some basic data cleaning transformation, and merge into a historical aggregate dataset.
- Labels:
-
CICD
-
JOBS
-
Structured streaming
-
Use Case
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-23-2021 06:20 PM
what I can think of is
1. put trigger processing time some interval rather than continuos.The Api hit of checkpoint storage increase cost,not dbus but for cloud vendor
2.If you have multiple streams then multiplex multiple streams into one,rather than different cluster for different streams.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
06-24-2021 02:06 AM
- There is always a trade-off between cost and batch execution time. It's possible to launch a small cluster and limit the data per batch and run it successfully. However, there is a chance the Streaming workload will develop backlog. So choosing the right cluster size is important and the prime factor deciding the sizing should be SLA for data availability rather than cost. but if the cost has more precedence then launching a small cluster will help.
- As you are doing an aggregate operation, it can involve state management as well. If so, choosing the best state store can also help to reduce unnecessary costs on Disk expansion.
- As @Kunal Gaurav mentioned, you can plan to run multiple streams on an interactive cluster. However note that Streaming applications can be long-running and because of the fact the on-demand clusters are cheaper, running the workloads on an on-demand cluster could be cheaper.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-10-2021 06:41 AM
This will help a lot pls ensure we follow these before moving to production
https://docs.databricks.com/spark/latest/structured-streaming/production.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-21-2022 07:10 PM
I second the recommendations: auto load with trigger, batch processing instead of continuous streaming where use case permits. In addition,
- test with a small batch first
- favor fewer larger workers over more smaller workers
- adjust your job cluster overtime, by looking at spark UI and cluster metrics to see where steps can be optimized and computing resources reduced
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-26-2022 04:28 AM
Yes correct one