topic Python segmentation fault in serverless job in Data Engineering

Python segmentation fault in serverless job

Malthe — Thu, 05 Mar 2026 10:03:55 GMT

We're getting a Python segmentation fault in a serverless job that uses Delta Table merge inside a foreachBatch step in structured streaming (trigger once).

/databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/streaming/query.py:479: UserWarning: StreamingQueryListenerBus Handler thread received exception, all client side listeners are removed and handler thread is terminated. The error is: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.CANCELLED details = "Channel closed!" debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"Channel closed!", grpc_status:1, created_time:"2026-03-05T06:33:35.843237805+00:00"}" > warnings.warn( /databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/streaming/query.py:479: UserWarning: StreamingQueryListenerBus Handler thread received exception, all client side listeners are removed and handler thread is terminated. The error is: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.CANCELLED details = "Channel closed!" debug_error_string = "UNKNOWN:Error received from peer {created_time:"2026-03-05T06:34:03.441994764+00:00", grpc_status:1, grpc_message:"Channel closed!"}" > warnings.warn( Fatal Python error: Segmentation fault

Looking at the serverless query UI, more time is spent in the task than can be accounted for (1.39m vs 4.14m).

Could this be related to the segmentation fault? We have lots of warnings and errors in the logs, but that's par for the course with Databricks in general.

Re: Python segmentation fault in serverless job

Louis_Frolio — Fri, 06 Mar 2026 14:17:11 GMT

@Malthe , this is hard to troubleshoot wihtout the code but I did some digging on my end and have few suggestions for you to consider.

The segfault (SIGSEGV) means the Python process crashed in native code -- CPython, a C/C++/Rust library, or JVM/native integration. This is not a normal Spark or Python exception. It's almost never caused by user Python logic alone.

The gRPC "Channel closed" / StatusCode.CANCELLED messages are a symptom, not the root cause. When the driver or worker process crashes, the client-side gRPC channel closes and you get those warnings.

The task timing (MERGE at ~13.9m vs 4.14m for the overall query) just tells you the MERGE is heavy. That alone doesn't explain a segfault, but heavy memory and CPU pressure during native I/O or Delta operations can trigger OS-level kills that surface as SIGSEGV.

Here's what I'd recommend for a case like this on Databricks serverless.

First, minimize the repro. Strip the foreachBatch down to just the MERGE with a small, fixed input sample. If that still segfaults, you have a compact repro to hand off to Databricks support.

Second, check for patterns that commonly trigger native crashes: very wide rows or huge batch sizes in the MERGE target or source, non-standard or heavy native Python libraries in the foreachBatch that may not be fully compatible with the serverless runtime, or extremely large broadcast joins and skewed data driving single executors to extreme memory use.

Third, try configuration workarounds where possible. If you're on a Serverless SQL Warehouse with Python UDFs, test the same logic on a Pro or classic cluster to isolate whether the issue is serverless-specific. You can also reduce per-task pressure -- smaller input batches, fewer partitions per foreachBatch run, a simpler MERGE (fewer WHEN MATCHED clauses, fewer columns) -- as a diagnostic step.

Fourth, gather artifacts for support: the exact workspace URL of the failed run and job/warehouse, the minimal foreachBatch/MERGE code, and approximate input sizes (rows, GB) along with table schema characteristics like very wide columns or large VARCHAR/BINARY fields.

Worth flagging -- given this is a native crash in a managed serverless runtime, the real fix usually comes down to one of two paths: adjusting workload shape to avoid a specific stress pattern, or getting a Databricks runtime/serverless bug investigated and addressed via support with a minimal repro.

Hope this helps, Louis.

Re: Python segmentation fault in serverless job

Malthe — Fri, 06 Mar 2026 15:48:02 GMT

We're not using any external libraries; this is just vanilla PySpark running on the latest serverless runtime (environment version 5). The segmentation fault must come from some of the Databricks software that powers the serverless platform, perhaps telemetry.

Hi @Malthe, Since you have confirmed this is vanilla PySp...

SteveOstrowski — Mon, 09 Mar 2026 05:59:52 GMT

Hi @Malthe,

Since you have confirmed this is vanilla PySpark with no external libraries on serverless runtime environment version 5, this narrows things down considerably. Here are some additional observations and recommendations beyond what Louis shared.

WHAT THE STACK TRACE TELLS US

The crash path is inside pyspark/sql/connect/streaming/query.py at line 479, which is the StreamingQueryListenerBus handler thread. Serverless compute runs exclusively through Spark Connect (the gRPC-based protocol), and the "Channel closed" / StatusCode.CANCELLED messages indicate that the server-side Spark session terminated (or the gRPC channel was recycled) while the client-side listener bus thread was still active. The segfault then occurs when the Python process tries to access memory through a now-invalid gRPC channel reference.

In other words, this looks like a timing issue in the Spark Connect streaming listener cleanup sequence, not a problem with your MERGE logic itself. The MERGE completes successfully on the server side, but the client-side Python process crashes during teardown.

THE TASK TIMING GAP

The discrepancy you see (1.39m accounted for vs 4.14m total) is consistent with this theory. The "unaccounted" time likely includes:

1. The Spark Connect session setup and teardown overhead
2. The listener bus polling interval before the crash
3. Any retry/backoff in the gRPC layer before the fatal signal

This gap alone does not indicate a problem with your MERGE performance.

RECOMMENDED NEXT STEPS

1. Check whether the job actually succeeds despite the segfault. In many Spark Connect streaming teardown crashes, the data is written correctly and the checkpoint is committed before the Python process crashes. Verify your target Delta table has the expected data after the run. If the data is correct, the segfault is happening during cleanup, not during the write.

2. Try wrapping your streaming query with explicit lifecycle management to give the listener bus time to shut down cleanly:

query = (
  df.writeStream
  .foreachBatch(merge_function)
  .trigger(availableNow=True)
  .option("checkpointLocation", checkpoint_path)
  .start()
)
query.awaitTermination()
# Add a brief pause before the Python process exits
import time
time.sleep(5)

The sleep gives the background gRPC threads time to close gracefully before Python tears down the process.

3. If you are running this as a Databricks Job task, check whether the task is marked as failed or succeeded in the job run history. If the task succeeds despite the segfault warning in stdout, the crash is cosmetic (happening after the streaming query has already finished).

4. File a support ticket. Since this is a crash in Databricks-managed native code on the serverless runtime with no user-installed libraries, the Databricks engineering team is best positioned to investigate and address it. Include:

 - The workspace URL and job run URL
 - The full driver logs from the failed run
 - The serverless environment version (v5 as you noted)
 - Approximate data volumes (rows and size of the source micro-batch, and the size of the target Delta table)

DOCUMENTATION REFERENCES

- Serverless compute limitations (note: only Spark Connect APIs are supported, and only Trigger.AvailableNow is supported for streaming):

https://docs.databricks.com/en/compute/serverless/limitations.html

- foreachBatch with Delta Lake merge in structured streaming:

https://docs.databricks.com/en/structured-streaming/foreach.html

- Structured streaming production best practices:

https://docs.databricks.com/en/structured-streaming/production.html

* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.

If this answer resolves your question, could you mark it as "Accept as Solution"? That helps other users quickly find the correct fix.