3 weeks ago - last edited 3 weeks ago
Hi Databricks experts,
We're experiencing unexpectedly high costs from Regional Standard Class A Operations in GCS while running a Databricks pipeline. The costs seem related to frequent metadata queries, possibly tied to Delta table operations.
In last month, our GCS has 324 916 780 operations for this category. Could you please advise on:
1. Configurations to reduce metadata queries in Databricks.
2. Best practices for managing GCS-related costs in such pipelines. I can share more details if needed.
Your assistance would be greatly appreciated.
3 weeks ago
@minhhung0507 its hard to say without having more direct insight, but generally speaking many streaming jobs with very frequent intervals will likely contribute; 300 jobs triggered continously will also contribute depending on the use case of these jobs, are all these all the same? Are these Spark Jobs, or "Jobs" in some other context? can these be consolidated?
Consider generating a histogram of metadata calls per job and prioritizing the sort in descending order of frequency, per job. Streaming jobs, especially with high trigger rates, often contribute significantly to the number of metadata operations, driving up GCS costs. Additionally, analyze the distribution of storage files by creating a histogram of files per directory and file sizes. Pay close attention to the _delta_log, checkpoint, and version files, as excessive file counts in these areas can escalate metadata operations. A 2-day retention policy combined with frequent vacuum and optimize tasks may further amplify LIST and GET calls.
Key recommendations include:
3 weeks ago
There are some approaches you can test:
spark.databricks.io.cache.enabled
configuration to true
.OPTIMIZE
command to compact small files into larger ones, which can reduce the number of metadata operations.3 weeks ago
@minhhung0507 thanks for your question!
The costs seem related to frequent metadata queries, possibly tied to Delta table operations.
Before optimizing, we should first confirm that the high GCS costs truly come from metadata operations triggered by Delta table activity.
Could you please look for which operations (GET, LIST, etc.) are driving costs? Correlate timestamps with Databricks jobs and sources. Then identify tables or operations that might be repeatedly scanning small files or listing directories. Confirm no other process (like separate data pipelines or watchers) is hitting GCS.
Only after collecting this evidence can you assert that frequent Delta metadata queries are causing the high costs. Then you can apply the usual strategies (OPTIMIZE, caching, file compaction, etc.) as a next step.
It is not always possible to avoid metadata operations, you may of course reduce them, but I believe its required to make sure these costs are coming from metadata operations first, and then review your spark jobs to figure out which could be optimized.
3 weeks ago
Thank you for providing us with several solutions regarding our high Google Cloud Storage costs. We have confirmed that the high costs are indeed coming from metadata operations triggered by Delta table activity, with the GET and LIST operations significantly contributing to the expenses.
We also have been implementing all three strategies: Caching, Optimize, and Z-Ordering. However, we are not seeing a significant reduction in costs.
Could you please advise if there are any additional methods we could apply? Additionally, are there any specific considerations we should be aware of? We understand that this problem is difficult to solve, so your guidance means a lot to us.
Thank you for your assistance!
Best regards,
Hung.
3 weeks ago
@minhhung0507 this is more or less aligned with my previous suggestion, where it is required to have full understanding of the nature of the sources of these metadata operations. Can you provide a summary of the process you followed to confirm these are coming from metadata operations triggered by Delta table activity and what sort of?
Generally speaking current suggestions would be to:
Pinpoint the Biggest Offenders
GET
and LIST
calls (e.g., is it a streaming job, a frequent batch job, or multiple concurrent jobs?).Reduce Small Files and Frequent Listing
spark.databricks.delta.autoCompact.enabled
, spark.databricks.delta.optimizeWrite.enabled
) to keep file counts low.Examine Workflow Frequency
Check Delta Retention & History
delta_log
directory can cause frequent metadata checks. Make sure you’re vacuuming older versions if you don’t need long retention for time travel.Tune for Fewer Metadata Operations
spark.databricks.io.cache.enabled = true
on all relevant clusters and check if caching is actually being utilized (e.g., repeated queries on the same data).So, have you identified which tables or queries are the main sources of repeated LIST
calls? Are there any pipelines or watchers outside Databricks also hitting these same GCS paths?
By first pinpointing and correlating where these calls originate (jobs, tables, intervals) and then tuning how often and how they list files, you should see a larger reduction in GCS Class A operations.
3 weeks ago
Hi @VZLA ,
Thank you for your detailed suggestions and guidance.
Here are some updates and points regarding our current setup:
Are you thinking the problem is 300 jobs being triggered continuously?
3 weeks ago
@minhhung0507 its hard to say without having more direct insight, but generally speaking many streaming jobs with very frequent intervals will likely contribute; 300 jobs triggered continously will also contribute depending on the use case of these jobs, are all these all the same? Are these Spark Jobs, or "Jobs" in some other context? can these be consolidated?
Consider generating a histogram of metadata calls per job and prioritizing the sort in descending order of frequency, per job. Streaming jobs, especially with high trigger rates, often contribute significantly to the number of metadata operations, driving up GCS costs. Additionally, analyze the distribution of storage files by creating a histogram of files per directory and file sizes. Pay close attention to the _delta_log, checkpoint, and version files, as excessive file counts in these areas can escalate metadata operations. A 2-day retention policy combined with frequent vacuum and optimize tasks may further amplify LIST and GET calls.
Key recommendations include:
3 weeks ago
Dear @VZLA ,
Thank you so much for your detailed insights and recommendations!
We truly appreciate your suggestion about visualizing and analyzing metadata operations—it’s an excellent idea that we can definitely apply to identify and prioritize optimizations. Additionally, avoiding full table scans whenever possible is another valuable approach we’ll consider.
We will review our entire pipeline thoroughly, apply your recommendations, and continue monitoring the system to ensure improvements.
Thanks again for your support!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group