- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-06-2026 08:53 AM
Hi @Malthe,
Sorry. I should have been a bit more precise with the terminology. You are right.. it is not reading those terabytes of data from the storage layer into the CPU.
Your snapshot shows bytes pruned was around 20 terabytes. Although it didn't actually read the data content of those 20 terabytes, the cluster still had to spend time (task time) to process the metadata for approxmately 260K pruned files. Reading that metadata does take time. And then there is the time to read the other 50K files. Even if it only took a few kilobytes from each to reach that 28 GB total, the overhead of opening 50K separate cloud storage connections is significant. I believe the time that you are referring to be unaccounted for is spent here.
Appreciate what you are saying in terms of the UI not being explicit about it. Having done some research, can you go a level deeper than the summary table? This may help find out exactly where those hours went.
- Click the Merge Into... box in the center of your screen.
- Scroll down to the Task Metrics table.
- Look for something like Scheduler Delay and Metadata Operations.
If Scheduler Delay is high, the time was spent waiting for the cluster to coordinate the 50,000 file reads.
If Task Time is high but Data Read is low, the time was spent opening file headers to check stats.
Let me know how that goes.
If this answer resolves your question, could you mark it as “Accept as Solution”? That helps other users quickly find the correct fix.
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***