3 weeks ago
The following serverless query has an aggregated task time of 1.94h, but each of the two executions runs in an aggregated time of just a few minutes.
How is one supposed to make sense of that?
3 weeks ago
Hi @Malthe,
Thanks for sharing the extra snapshots and observations.
Youโre absolutely right that the coreโhour math and the 8 DBUs line up. The issue youโre hitting is that the query profile UI isnโt a fully additive costโaccounting view, especially for complex DML and serverless.
A few clarifications based on how the system works today:
So from a billing perspective, the DBUs youโre seeing are consistent with the aggregated task time. From a transparency perspective, your criticism is fair. The current query profile is optimised for finding bottlenecks, not yet for reconstructing every billed minute. Feedback like this is exactly what weโre using to improve the serverless profiling and costโattribution experience.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
3 weeks ago
Hi This is interesting however i noted this definiton for aggregated task time :
Combined time it took to execute the query across all cores of all nodes. It can be significantly longer than the wall-clock duration if multiple tasks executed in parallel. It can be shorter than the wall-clock duration if tasks waited for available nodes.
so basically,
The aggregated task time is not related to the execution time but the sum of all individual task time.ex: if you have 5 tasks each ran for 1min in parallel then The execution time might be 1 min but the aggregated task time would be 5 min
3 weeks ago
This query resulted in two executions. I have highlighted one in the screenshot, where it says that the time there is also aggregated (across cores). So how can two executions with each an aggregated time in minutes become almost two hours?
From the system billing table the query cost about 8 DBU, which are basically unaccounted for.
3 weeks ago
Hi @Malthe,
The discrepancy youโre seeing between Aggregated Task Time (1.94h) and the actual Execution Time (~2.45m) is actually a sign of healthy parallelism, not an error. Here is how to interpret those two numbers:
The gap likely looks large because there is significant work happening. If you look at the IO metrics, the rows read is around 9 billion. Approxmately 50K files. To process that many rows and scan around 20 terabytes of data in under 3 minutes, your serverless warehouse distributed the load across a large number of concurrent tasks. The 1.94 hours is simply the total effort of all those parallel workers added together.
On a positive note, this is a great sign. ๐ It shows that almost all the heavy lifting was handled by the high-performance C++ engine rather than the standard spark JVM.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
3 weeks ago
@Ashwin_DSA the job does not scan terabytes of data, all of that is pruned away (see the screenshot); it actually reads just 28 GB of data.
But again, the time is not accounted for, despite the user interface saying that the executions are shown in aggregated time as well. It just doesn't match. And there's also no way to debug thisโyou simply can't see where the time is spent.
3 weeks ago
Hi @Malthe,
Sorry. I should have been a bit more precise with the terminology. You are right.. it is not reading those terabytes of data from the storage layer into the CPU.
Your snapshot shows bytes pruned was around 20 terabytes. Although it didn't actually read the data content of those 20 terabytes, the cluster still had to spend time (task time) to process the metadata for approxmately 260K pruned files. Reading that metadata does take time. And then there is the time to read the other 50K files. Even if it only took a few kilobytes from each to reach that 28 GB total, the overhead of opening 50K separate cloud storage connections is significant. I believe the time that you are referring to be unaccounted for is spent here.
Appreciate what you are saying in terms of the UI not being explicit about it. Having done some research, can you go a level deeper than the summary table? This may help find out exactly where those hours went.
If Scheduler Delay is high, the time was spent waiting for the cluster to coordinate the 50,000 file reads.
If Task Time is high but Data Read is low, the time was spent opening file headers to check stats.
Let me know how that goes.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
3 weeks ago
@Ashwin_DSA there's no such metrics available here:
Also interesting is that the other execution just has 0's for every single value. No time spent, 0 for everything.
I might be able to get more detailed information if writing first to a staging table, then doing the merge. I believe merge operations hide most of the query execution details currently.
3 weeks ago
Thanks, @Malthe for checking that.
Since I can't replicate your exact environment, I can only assume that the missing time and the associated DBU cost are probably due to parallelism.
The fact that Photon accounts for 95% of the task time reinforces that itโs performing vectorised compute across a lot of data... and that the cluster is simply doing a large amount of work in parallel.
With regards to your earlier comment about the DBU cost, which is unaccounted for.... DBUs are a measure of compute used over time (roughly: cores ร time ร SKU factor). So if your query runs for a few minutes on a serverless cluster with multiple cores, itโs completely plausible to see something like 8 DBUs, even though:
In other words, those 8 DBUs are accounted for... they correspond to the parallel work the cluster did. It's just expressed in a different unit. If you want to sanity-check further, compare the bytes/rows processed in the query profile with the DBUs in the billing table for the same statement ID. If it still looks off, itโs worth opening a support ticket so the internal metrics for that specific run can be reviewed.
Hope this helps.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
3 weeks ago
I understand the difference between wall-clock and aggregated time across cores, it's documented under query profiling.
According to the pricing docs, presently a DBU is roughly equivalent to a DS3 v2 instance with 4 cores. What I mean by unaccounted for is that we have a query that runs for a total aggregated time of 1.94h with a wall-clock runtime of 4-5 minutes. The math checks out because that's roughly 24 core hours and matches the expectations of 8 DBUs with the 4-core instance mentioned above.
The problem is that the execution breakdown accounts for just a fraction of that and it purports to be the complete picture. It's basically totally broken for this type of query.
I changed the processing now to first materialize to a staging table, then doing the merge. This reduces the runtime to 40 minutes, but the query details are still lacking any fidelity with the actual aggregated runtime:
Observations:
Lots of tasks types contribute no metrics at all:
In summary, there's basically no transparency here. The total cost of a query simply can't be traced back to its constituents in any meaningful way. That's a significant setback from classic compute where the Spark UI along with basic compute metrics did in fact account for the time spent and the resulting cost.
3 weeks ago
Hi @Malthe,
Thanks for sharing the extra snapshots and observations.
Youโre absolutely right that the coreโhour math and the 8 DBUs line up. The issue youโre hitting is that the query profile UI isnโt a fully additive costโaccounting view, especially for complex DML and serverless.
A few clarifications based on how the system works today:
So from a billing perspective, the DBUs youโre seeing are consistent with the aggregated task time. From a transparency perspective, your criticism is fair. The current query profile is optimised for finding bottlenecks, not yet for reconstructing every billed minute. Feedback like this is exactly what weโre using to improve the serverless profiling and costโattribution experience.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.
2 weeks ago
Hi @Malthe,
There are two separate things going on here, and they are likely independent of each other. Let me break them down.
TASK TIME vs EXECUTION TIME DISCREPANCY
The gap between aggregated task time (1.39m) and total execution time (4.14m) is expected behavior and is not caused by the segfault. Aggregated task time only measures the wall-clock time that Spark tasks spend actively computing on executors. It does not include:
- Scheduler delay (time between task submission and actual execution)
- Serialization and deserialization of task results
- Shuffle read/write I/O wait time
- Driver-side orchestration overhead
- Spark Connect gRPC round-trip communication (this is especially relevant on serverless)
On serverless compute, all Spark interactions go through the Spark Connect API over gRPC. This adds a communication layer between your driver code and the Spark execution engine. For a MERGE operation inside foreachBatch, there are multiple planning, optimization, and coordination steps that happen on the driver side and over gRPC that do not count as "task time" but do count toward total execution duration.
Additionally, the MERGE operation itself has phases (reading the target table, computing the join, writing deletes/updates/inserts) where the time between phases is driver coordination, not active task execution.
You can verify this by looking at the Spark UI for that run. Under the SQL/DataFrame tab, you will see the query plan and can see time spent in different stages. The "Duration" column in the Stages tab will also show scheduling delay and other overhead separately from task compute time.
SEGMENTATION FAULT AND gRPC CHANNEL CLOSED ERRORS
The "Fatal Python error: Segmentation fault" combined with the StreamingQueryListenerBus gRPC "Channel closed" warnings points to an issue in the serverless runtime itself, not in your PySpark code. Since you confirmed you are using vanilla PySpark with no external libraries on environment version 5, this is a runtime-level crash.
The gRPC "Channel closed" messages are a symptom of the crash, not the cause. When the underlying process crashes (SIGSEGV), the Spark Connect gRPC channel is forcibly closed, which triggers those StreamingQueryListenerBus warnings.
A few things to check and try:
1. Trigger type: if you are using Trigger.Once, note that it has been deprecated since Databricks Runtime 11.3 LTS. Switch to Trigger.AvailableNow, which is the supported trigger for incremental batch processing on serverless.
https://docs.databricks.com/en/structured-streaming/triggers.html
2. Caching in foreachBatch: the Databricks documentation recommends caching the batch DataFrame before a MERGE to avoid reading the input multiple times. However, on serverless compute, DataFrame caching (df.cache(), spark.catalog.cacheTable()) is not supported and will throw exceptions. If your foreachBatch code includes any caching calls, remove them.
https://docs.databricks.com/en/structured-streaming/foreach.html
3. MERGE idempotency: make sure your MERGE statement inside foreachBatch is idempotent, as restarts can re-apply the same batch.
https://docs.databricks.com/en/structured-streaming/delta-lake.html
4. Batch size: if the MERGE is processing a very large batch, try limiting the input with maxBytesPerTrigger or maxFilesPerTrigger to reduce per-batch memory pressure. Heavy memory use during native Delta I/O operations can contribute to OS-level process termination that surfaces as SIGSEGV.
5. File a support ticket: since this is a native crash in the managed serverless runtime with vanilla PySpark, the root cause is most likely in the runtime itself. Open a support ticket and include the workspace URL, the job run URL, and the exact environment version. The engineering team can examine the crash dump and determine whether this is a known issue with a fix in a newer environment version.
The task time discrepancy is normal and unrelated to the crash. The segfault is the issue that needs attention, and a support ticket is the best path to resolution for a native crash on serverless.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.
2 weeks ago
@Ashwin_DSA by the way, "Open in Spark UI" is available only for SQL Warehouse; it's not available for serverless notebooks or jobs.
2 weeks ago
Hi @Malthe,
Good callout on Spark UI access. You are right. The "Open in Spark UI" link is only documented and supported for Databricks SQL queries on SQL warehouses, not for serverless notebooks or jobs. So for pure serverless jobs/notebooks, you donโt currently have that fallback to the classic Spark UI, which makes the gaps in the query profile UI even more visible.
I agree this still falls short of the "every billed minute is traceable" experience you are used to from classic Spark UI, especially on serverless. Your feedback about making the serverless profiling view reconcile cleanly with DBUs (and exposing more of the hidden/grouped work) is completely valid and is being passed on to the product team.
If this answer resolves your question, could you mark it as โAccept as Solutionโ? That helps other users quickly find the correct fix.