cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why do I see a cost explosion in my blob storage account (DBFS storage, blob storage, ...) for my structures streaming job?

User16857281869
New Contributor II

Its usually one or more of the following reasons:

1) If you are streaming into a table, you should be using .Trigger option to specify the frequency of checkpointing. Otherwise, the job will call the storage API every 10ms to log the transaction data. That will explode the cost even there is no data coming in that fast.

2) If you are aggregating the data, the higher the number of partition you have results in higher checkpointing of the data, as the data gets checkpointed for each partition. Set sql.shuffle.partition ideally to the number of workers

3) When writing to a delta make sure delta.autoOptimize.optimizeWrite = true

to reduce the number of files written (for no low latency use cases) While writing to Delta, we “list” the transaction log and “put” 1 file per sql.shuffle.partition per 1 table partition folder and then 1 more put per transaction log. E.g. if target table is partitioned by Date and we get INSERTS for today and some UPDATES for last 9 days - so a total of 10 table partitions are affected and if the sql.shuffle.partition =200 then per microbatch/trigger we have 2000 API calls at the minimum

4) Try not to u se display() function. Checkpoint files are being created, but are not being deleted.

You can verify the problem by navigating to the root directory and looking in the 

/local_disk0/tmp/ folder. Checkpoint files remain in the folder.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III
  • please mount cheaper storage (LRS) to custom mount and set there checkpoints,
  • please clear data regularly,
  • if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,
  • please remember not to use display() in production,
  • if on that storage you don't pay much per GB it is better to use "premium" class storage as it has higher ratio per GB but lower for all other operations

View solution in original post

1 REPLY 1

Hubert-Dudek
Esteemed Contributor III
  • please mount cheaper storage (LRS) to custom mount and set there checkpoints,
  • please clear data regularly,
  • if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,
  • please remember not to use display() in production,
  • if on that storage you don't pay much per GB it is better to use "premium" class storage as it has higher ratio per GB but lower for all other operations

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group