cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Slow batch processing in Databricks job due to high deletion vector and unified unified cache overhe

minhhung0507
Valued Contributor

We have a Databricks pipeline where the layer reads from several Silver tables to detect PK/FK changes and trigger updates to Gold tables. Normally, this near real-time job has ~3 minutes latency per micro-batch.

Recently, we noticed that each batch is running much slower (10โ€“20+ minutes), even after scaling from N2-8 to N2-16 for both driver and executors โ€” without noticeable improvement.

I extracted job metrics and sorted them by AvgCompletionTime. The top bottlenecks are:

  • Deletion vector operations (number of deletion vector rows read, time spent reading deletion vectors, etc.) with ~6,100โ€“6,300 seconds average completion time per task.

  • Unified cache operations (unified cache populate time for ParquetFooter, unified cache populate time for RangeDeletionVector, unified cache read/serve bytes...) with ~5,000 seconds average completion time per task.

From the metrics, it seems that:

  • A lot of tasks are reading from disk cache (unified cache) but also spending significant time populating the cache.

  • Many Parquet footers, column chunks, and deletion vectors are being cached on the fly.

  • Cache hit ratio may be low due to constantly reading new data from CDF.

My questions are:

  1. In a near real-time workload where data changes frequently, could unified cache cause more overhead than benefit (due to low hit rate)?

  2. Are there recommended strategies to reduce deletion vector overhead in Delta tables (e.g., optimize, vacuum, rewrite)?
  3. Any tuning tips for improving batch performance in this case?

Would appreciate any guidance or best practices from the community.

Metrics table:

 

MetricNameTotalMetricValueTotalNumberTasksAvgCompletionTime
number of non-local (rescheduled) scan tasks10109531.9
unified cache hits count for RangeDeletionVector162448141376274.97798742138
number of deletion vector rows read from disk cache1710803141546138.35435435436
number of deletion vector rows read1710803141546138.35435435436
number of deletion vectors read162794141546138.35435435436
number of deletion vectors read from disk cache162794141546138.35435435436
internal.metrics.resultSerializationTime14227925848.54285714286
unified cache coalesce count for RangeDeletionVector24225545409.65979381443
time spent reading deletion vectors636567154841087835310.43413173653
size of deletion vectors read from disk cache86108131087835310.43413173653
size of deletion vectors read86108131087835310.43413173653
time spent waiting for fetched deletion vectors01087835310.43413173653
size of deletion vectors read from memory cache01087835310.43413173653
unified cache coalesce count for ParquetFooter6711135191.16279069768
unified cache hits count for ParquetFooter6822851021555115.15193026152
unified cache hits count for ParquetColumnChunk24182751021655109.88888888889
scan time40631191068145071.2162818955
number of input batches16627571068145071.2162818955
unified cache populate time for ParquetFooter4857124575121097445009.97760617761
uncompressed bytes read after filtering2628582903091097445009.97760617761
cache hits size (uncompressed)2611501073871097445009.97760617761
cache hits size1982504351651097445009.97760617761
stable cache serve bytes for ParquetColumnChunk1874847423741097445009.97760617761
unified cache serve bytes for ParquetColumnChunk1874847423741097445009.97760617761
unified cache read bytes for ParquetColumnChunk1847457669451097445009.97760617761
unified cache populate time for RangeDeletionVector213166923331097445009.97760617761
unified cache serve bytes for ParquetFooter118430891231097445009.97760617761
stable cache serve bytes for ParquetFooter118430891231097445009.97760617761
unified cache read bytes for ParquetFooter117708279581097445009.97760617761
Regards,
Hung Nguyen
2 REPLIES 2

noorbasha534
Valued Contributor II

Hello @minhhung0507 trust all is well at your end. May I know how you captured these job metrics?

noorbasha534
Valued Contributor II

@minhhung0507 as per documentation -

'The actual physical removal of deleted rows (the "hard delete") is deferred until the table is optimized with OPTIMIZE or when a VACUUM operation is run, cleaning up old files.'

So, based on this, try to optimize the table once the data load finishes & verify metrics in the next run.