Databricks

User16752245312 · ‎06-23-2021

Consider a basic structured streaming use case of aggregating the data, perform some basic data cleaning transformation, and merge into a historical aggregate dataset.

User16826994223 · ‎06-23-2021

what I can think of is

1. put trigger processing time some interval rather than continuos.The Api hit of checkpoint storage increase cost,not dbus but for cloud vendor

2.If you have multiple streams then multiplex multiple streams into one,rather than different cluster for different streams.

User16869510359 · ‎06-24-2021

There is always a trade-off between cost and batch execution time. It's possible to launch a small cluster and limit the data per batch and run it successfully. However, there is a chance the Streaming workload will develop backlog. So choosing the right cluster size is important and the prime factor deciding the sizing should be SLA for data availability rather than cost. but if the cost has more precedence then launching a small cluster will help.
As you are doing an aggregate operation, it can involve state management as well. If so, choosing the best state store can also help to reduce unnecessary costs on Disk expansion.
As @Kunal Gaurav mentioned, you can plan to run multiple streams on an interactive cluster. However note that Streaming applications can be long-running and because of the fact the on-demand clusters are cheaper, running the workloads on an on-demand cluster could be cheaper.

Soma · ‎12-10-2021

This will help a lot pls ensure we follow these before moving to production

https://docs.databricks.com/spark/latest/structured-streaming/production.html

lawrence009 · ‎12-21-2022

I second the recommendations: auto load with trigger, batch processing instead of continuous streaming where use case permits. In addition,

test with a small batch first
favor fewer larger workers over more smaller workers
adjust your job cluster overtime, by looking at spark UI and cluster metrics to see where steps can be optimized and computing resources reduced

Meghala · ‎12-26-2022

Yes correct one

Databricks

When running structured streaming jobs in production, what are the general best practices to reduce cost?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs