Databricks

User16857281869 · ‎06-18-2021

Its usually one or more of the following reasons:

1) If you are streaming into a table, you should be using .Trigger option to specify the frequency of checkpointing. Otherwise, the job will call the storage API every 10ms to log the transaction data. That will explode the cost even there is no data coming in that fast.

2) If you are aggregating the data, the higher the number of partition you have results in higher checkpointing of the data, as the data gets checkpointed for each partition. Set sql.shuffle.partition ideally to the number of workers

3) When writing to a delta make sure delta.autoOptimize.optimizeWrite = true

to reduce the number of files written (for no low latency use cases) While writing to Delta, we “list” the transaction log and “put” 1 file per sql.shuffle.partition per 1 table partition folder and then 1 more put per transaction log. E.g. if target table is partitioned by Date and we get INSERTS for today and some UPDATES for last 9 days - so a total of 10 table partitions are affected and if the sql.shuffle.partition =200 then per microbatch/trigger we have 2000 API calls at the minimum

4) Try not to u se display() function. Checkpoint files are being created, but are not being deleted.

You can verify the problem by navigating to the root directory and looking in the

/local_disk0/tmp/ folder. Checkpoint files remain in the folder.

Hubert-Dudek · ‎11-27-2021

please mount cheaper storage (LRS) to custom mount and set there checkpoints,
please clear data regularly,
if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,
please remember not to use display() in production,
if on that storage you don't pay much per GB it is better to use "premium" class storage as it has higher ratio per GB but lower for all other operations

View solution in original post

Hubert-Dudek · ‎11-27-2021

please mount cheaper storage (LRS) to custom mount and set there checkpoints,
please clear data regularly,
if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,
please remember not to use display() in production,
if on that storage you don't pay much per GB it is better to use "premium" class storage as it has higher ratio per GB but lower for all other operations

Databricks

Why do I see a cost explosion in my blob storage account (DBFS storage, blob storage, ...) for my structures streaming job?

Announcing the General Availability of Databricks Asset Bundles

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Register now and save 50% on training at Data + AI Summit!