cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Why do I see a cost explosion in my blob storage account (DBFS storage, blob storage, ...) for my structures streaming job?

User16857281869
New Contributor II

Its usually one or more of the following reasons:

1) If you are streaming into a table, you should be using .Trigger option to specify the frequency of checkpointing. Otherwise, the job will call the storage API every 10ms to log the transaction data. That will explode the cost even there is no data coming in that fast.

2) If you are aggregating the data, the higher the number of partition you have results in higher checkpointing of the data, as the data gets checkpointed for each partition. Set sql.shuffle.partition ideally to the number of workers

3) When writing to a delta make sure delta.autoOptimize.optimizeWrite = true

to reduce the number of files written (for no low latency use cases) While writing to Delta, we “list” the transaction log and “put” 1 file per sql.shuffle.partition per 1 table partition folder and then 1 more put per transaction log. E.g. if target table is partitioned by Date and we get INSERTS for today and some UPDATES for last 9 days - so a total of 10 table partitions are affected and if the sql.shuffle.partition =200 then per microbatch/trigger we have 2000 API calls at the minimum

4) Try not to u se display() function. Checkpoint files are being created, but are not being deleted.

You can verify the problem by navigating to the root directory and looking in the 

/local_disk0/tmp/ folder. Checkpoint files remain in the folder.

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III
  • please mount cheaper storage (LRS) to custom mount and set there checkpoints,
  • please clear data regularly,
  • if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,
  • please remember not to use display() in production,
  • if on that storage you don't pay much per GB it is better to use "premium" class storage as it has higher ratio per GB but lower for all other operations

View solution in original post

1 REPLY 1

Hubert-Dudek
Esteemed Contributor III
  • please mount cheaper storage (LRS) to custom mount and set there checkpoints,
  • please clear data regularly,
  • if you are using forEac/forEatchBatchh in stream it will save every dataframe on dbfs,
  • please remember not to use display() in production,
  • if on that storage you don't pay much per GB it is better to use "premium" class storage as it has higher ratio per GB but lower for all other operations
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.