Re: Python segmentation fault in serverless job

Louis_Frolio · ‎03-06-2026

@Malthe , this is hard to troubleshoot wihtout the code but I did some digging on my end and have few suggestions for you to consider.

The segfault (SIGSEGV) means the Python process crashed in native code -- CPython, a C/C++/Rust library, or JVM/native integration. This is not a normal Spark or Python exception. It's almost never caused by user Python logic alone.

The gRPC "Channel closed" / StatusCode.CANCELLED messages are a symptom, not the root cause. When the driver or worker process crashes, the client-side gRPC channel closes and you get those warnings.

The task timing (MERGE at ~13.9m vs 4.14m for the overall query) just tells you the MERGE is heavy. That alone doesn't explain a segfault, but heavy memory and CPU pressure during native I/O or Delta operations can trigger OS-level kills that surface as SIGSEGV.

Here's what I'd recommend for a case like this on Databricks serverless.

First, minimize the repro. Strip the foreachBatch down to just the MERGE with a small, fixed input sample. If that still segfaults, you have a compact repro to hand off to Databricks support.

Second, check for patterns that commonly trigger native crashes: very wide rows or huge batch sizes in the MERGE target or source, non-standard or heavy native Python libraries in the foreachBatch that may not be fully compatible with the serverless runtime, or extremely large broadcast joins and skewed data driving single executors to extreme memory use.

Third, try configuration workarounds where possible. If you're on a Serverless SQL Warehouse with Python UDFs, test the same logic on a Pro or classic cluster to isolate whether the issue is serverless-specific. You can also reduce per-task pressure -- smaller input batches, fewer partitions per foreachBatch run, a simpler MERGE (fewer WHEN MATCHED clauses, fewer columns) -- as a diagnostic step.

Fourth, gather artifacts for support: the exact workspace URL of the failed run and job/warehouse, the minimal foreachBatch/MERGE code, and approximate input sizes (rows, GB) along with table schema characteristics like very wide columns or large VARCHAR/BINARY fields.

Worth flagging -- given this is a native crash in a managed serverless runtime, the real fix usually comes down to one of two paths: adjusting workload shape to avoid a specific stress pattern, or getting a Databricks runtime/serverless bug investigated and addressed via support with a minimal repro.

Hope this helps, Louis.