Databricks Community

Brad · ‎10-12-2024

Hi,

I have a delta table which is loaded by structured streaming job. When I tried to read this delta table and do a MERGE with foreachBatch, I found sometimes there is a big interval between streaming starts and MERGE starting to run and seems spark is waiting for something. From log I can see

INFO ExecuteGrpcResponseSender: Starting for opId=5ef071b7-xxx, reattachable=true, lastConsumedStreamIndex=0
...
INFO SessionHolder: Session SessionKey(69xxx,04470efa-xxxx) accessed, time 1728792222507.
...
INFO ExecuteGrpcResponseSender: Deadline reached, shutting down stream for opId=5ef071b7-xxx after index 0. totalTime=120001284340ns waitingForResults=120001197790ns waitingForSend=0ns
INFO SessionHolder: Session SessionKey(69xxx,04470efa-xxxx) accessed, time 1728792342527.
INFO ExecuteGrpcResponseSender: Starting for opId=5ef071b7-xxx, reattachable=true, lastConsumedStreamIndex=0
...

there are many "INFO ExecuteGrpcResponseSender: Deadline reached, shutting down stream..." and seems something is time out after 120s. I tried to set

spark.network.timeout: 800s
spark.streaming.backpressure.enabled: true

but still can find those deadline info in log.
What happened here? Is there some config I can make to remove this as seems it slows down the job.

Thanks

NandiniN · ‎10-31-2024

We need to understand, why upstream of the repl cancelled the request. It could be resource exhaustion. Do you see "java.lang.OutOfMemoryError"?

I saw https://issues.apache.org/jira/browse/SPARK-49492 to be the cause of such an error in one of the past issues.

Do you regularly see this issue, or is intermittent? Restart of the cluster will cause the issue to be mitigated, but to get review the logs you may have to enable cluster log delivery to investigate further.

Brad · ‎11-01-2024

This might be a bug. The issue is gone if I change the cluster from shared mode to single user mode

NandiniN · ‎11-05-2024

It may not necessarily be a bug, but some tuning due to architectural differences.

What the message says is:

The system was processing a gRPC operation identified by opId=5ef071b7-xxx, and it set a deadline for that operation (likely 120 seconds).
The operation didn't complete in time and exceeded the deadline, so the system has shut down the stream and stopped waiting for further results.
The operation spent almost all of its time (around 120 seconds) waiting for results and did not spend any time in the process of sending data back to the client.
It is an INFO message, indicating an event.
Shared mode uses spark connect underlying, but single user mode does not and hence we do not see the logs in single user cluster.

However, as our next steps:

We can try to understand the cause of the delay on the external resource or service where the request is sent.
It is possible the timeouts are not same, or needs to be bumped up.
Are there any other errors that you see along with these messages?

However, yes it may need more indepth look.

Databricks Community

What is "ExecuteGrpcResponseSender: Deadline reached, shutting down stream"

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon