Databricks Community

minhhung0507 · ‎01-01-2025

Hi Databricks experts,

We're experiencing unexpectedly high costs from Regional Standard Class A Operations in GCS while running a Databricks pipeline. The costs seem related to frequent metadata queries, possibly tied to Delta table operations.

In last month, our GCS has 324 916 780 operations for this category. Could you please advise on:

1. Configurations to reduce metadata queries in Databricks.

2. Best practices for managing GCS-related costs in such pipelines. I can share more details if needed.

Your assistance would be greatly appreciated.

Regards,
Hung Nguyen

VZLA · ‎01-03-2025

@minhhung0507 its hard to say without having more direct insight, but generally speaking many streaming jobs with very frequent intervals will likely contribute; 300 jobs triggered continously will also contribute depending on the use case of these jobs, are all these all the same? Are these Spark Jobs, or "Jobs" in some other context? can these be consolidated?

Consider generating a histogram of metadata calls per job and prioritizing the sort in descending order of frequency, per job. Streaming jobs, especially with high trigger rates, often contribute significantly to the number of metadata operations, driving up GCS costs. Additionally, analyze the distribution of storage files by creating a histogram of files per directory and file sizes. Pay close attention to the _delta_log, checkpoint, and version files, as excessive file counts in these areas can escalate metadata operations. A 2-day retention policy combined with frequent vacuum and optimize tasks may further amplify LIST and GET calls.

Key recommendations include:

Consolidate jobs: Assess if 300 jobs per minute can be combined into fewer, larger jobs.
Identify and optimize queries that contribute most to metadata operations.
Avoid full table scans where possible.
For streaming jobs, use larger, less frequent batches instead of high-frequency triggers, as near-real-time streams are major contributors to metadata calls.

View solution in original post

Walter_C · ‎01-02-2025

There are some approaches you can test:

Delta Caching: Enable Delta caching to reduce the number of metadata queries. This can be done by setting the spark.databricks.io.cache.enabled configuration to true.
Optimize Command: Use the OPTIMIZE command to compact small files into larger ones, which can reduce the number of metadata operations.
Z-Ordering: Apply Z-Ordering to your Delta tables to optimize the layout of data, which can help reduce the number of metadata queries during read operations.

VZLA · ‎01-02-2025

@minhhung0507 thanks for your question!

The costs seem related to frequent metadata queries, possibly tied to Delta table operations.

Before optimizing, we should first confirm that the high GCS costs truly come from metadata operations triggered by Delta table activity.

Could you please look for which operations (GET, LIST, etc.) are driving costs? Correlate timestamps with Databricks jobs and sources. Then identify tables or operations that might be repeatedly scanning small files or listing directories. Confirm no other process (like separate data pipelines or watchers) is hitting GCS.

Only after collecting this evidence can you assert that frequent Delta metadata queries are causing the high costs. Then you can apply the usual strategies (OPTIMIZE, caching, file compaction, etc.) as a next step.

It is not always possible to avoid metadata operations, you may of course reduce them, but I believe its required to make sure these costs are coming from metadata operations first, and then review your spark jobs to figure out which could be optimized.

minhhung0507 · ‎01-02-2025

Dear @VZLA , @Walter_C ,

Thank you for providing us with several solutions regarding our high Google Cloud Storage costs. We have confirmed that the high costs are indeed coming from metadata operations triggered by Delta table activity, with the GET and LIST operations significantly contributing to the expenses.

We also have been implementing all three strategies: Caching, Optimize, and Z-Ordering. However, we are not seeing a significant reduction in costs.

Could you please advise if there are any additional methods we could apply? Additionally, are there any specific considerations we should be aware of? We understand that this problem is difficult to solve, so your guidance means a lot to us.

Thank you for your assistance!

Best regards,

Hung.

Regards,
Hung Nguyen

VZLA · ‎01-03-2025

@minhhung0507 this is more or less aligned with my previous suggestion, where it is required to have full understanding of the nature of the sources of these metadata operations. Can you provide a summary of the process you followed to confirm these are coming from metadata operations triggered by Delta table activity and what sort of?

Generally speaking current suggestions would be to:

Pinpoint the Biggest Offenders
- Verify which specific workloads/jobs cause the highest GCS GET and LIST calls (e.g., is it a streaming job, a frequent batch job, or multiple concurrent jobs?).
- Check run logs or GCS logs for timestamps and correlate them with Databricks jobs to isolate the main contributors.
Reduce Small Files and Frequent Listing
- Even after OPTIMIZE, new small files may still appear if data arrives in small batches. Consider Auto Optimize and Auto Compaction (spark.databricks.delta.autoCompact.enabled, spark.databricks.delta.optimizeWrite.enabled) to keep file counts low.
- Partition carefully—too many partitions means more directory listing. Evaluate whether you can reduce partition granularity.
Examine Workflow Frequency
- Rapidly triggered streaming/batch jobs repeatedly list the same directories. If possible, reduce trigger frequency or batch intervals, or consolidate ingestion into fewer but larger micro-batches.
Check Delta Retention & History
- A large delta_log directory can cause frequent metadata checks. Make sure you’re vacuuming older versions if you don’t need long retention for time travel.
- Avoid “churn” in high-partition tables (dropping/adding partitions excessively).
Tune for Fewer Metadata Operations
- Confirm spark.databricks.io.cache.enabled = true on all relevant clusters and check if caching is actually being utilized (e.g., repeated queries on the same data).
- If possible, read data from Delta’s manifest files (for external systems) rather than listing directories, though this is more advanced.

So, have you identified which tables or queries are the main sources of repeated LIST calls? Are there any pipelines or watchers outside Databricks also hitting these same GCS paths?

By first pinpointing and correlating where these calls originate (jobs, tables, intervals) and then tuning how often and how they list files, you should see a larger reduction in GCS Class A operations.

minhhung0507 · ‎01-03-2025

Hi @VZLA ,

Thank you for your detailed suggestions and guidance.

Here are some updates and points regarding our current setup:

The primary workloads/jobs we use are streaming jobs running on Databricks.
We’ve already implemented Auto Optimize and Auto Compaction. Instead of partitions, we use Clustering for our data, which we believe is a viable alternative.
Currently, we have approximately 300 jobs triggered every minute.
Data retention is set to around 2 days, and we regularly perform vacuuming and optimization tasks.
The setting spark.databricks.io.cache.enabled = true is also enabled across all relevant clusters.
Yes, we do have some large tables with significant amounts of data, and we suspect these are contributing to the frequent directory listing operations. Additionally, there are no pipelines or external watchers outside Databricks writing to GCS.

Are you thinking the problem is 300 jobs being triggered continuously?

Regards,
Hung Nguyen

VZLA · ‎01-03-2025

@minhhung0507 its hard to say without having more direct insight, but generally speaking many streaming jobs with very frequent intervals will likely contribute; 300 jobs triggered continously will also contribute depending on the use case of these jobs, are all these all the same? Are these Spark Jobs, or "Jobs" in some other context? can these be consolidated?

Consider generating a histogram of metadata calls per job and prioritizing the sort in descending order of frequency, per job. Streaming jobs, especially with high trigger rates, often contribute significantly to the number of metadata operations, driving up GCS costs. Additionally, analyze the distribution of storage files by creating a histogram of files per directory and file sizes. Pay close attention to the _delta_log, checkpoint, and version files, as excessive file counts in these areas can escalate metadata operations. A 2-day retention policy combined with frequent vacuum and optimize tasks may further amplify LIST and GET calls.

Key recommendations include:

Consolidate jobs: Assess if 300 jobs per minute can be combined into fewer, larger jobs.
Identify and optimize queries that contribute most to metadata operations.
Avoid full table scans where possible.
For streaming jobs, use larger, less frequent batches instead of high-frequency triggers, as near-real-time streams are major contributors to metadata calls.

minhhung0507 · ‎01-03-2025

Dear @VZLA ,

Thank you so much for your detailed insights and recommendations!

We truly appreciate your suggestion about visualizing and analyzing metadata operations—it’s an excellent idea that we can definitely apply to identify and prioritize optimizations. Additionally, avoiding full table scans whenever possible is another valuable approach we’ll consider.

We will review our entire pipeline thoroughly, apply your recommendations, and continue monitoring the system to ensure improvements.

Thanks again for your support!

Regards,
Hung Nguyen