topic Re: Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes in Data Engineering

Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes

ashishCh — Tue, 25 Nov 2025 12:43:33 GMT

This error pops up in my Databricks workflow 1 out of 10 times, and everytime it occurs I see the below message in event logs.
Compute upsize complete, but below target size. The current worker count is 1, out of a target of 3.

And right after this my job cluster terminates with the socket error message.

These are my cluster configs, if required.

Re: Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes

Coffee77 — Tue, 25 Nov 2025 13:31:30 GMT

Difficult to know but maybe it has to do with usage of spot instances as it seems root cause is kind of random. In theory spot instances can be terminated at any time by cloud provider if it needs the capacity back, BUT databricks should handle this fact correctly to replace lost spot workers or apply resilient policies to avoid that type of errors.

So, I can't ensure that is your issue. However, you can try to disable that option for a given time taking into account that costs will be a little higher. In anycase, don't use "spot" instances in PROD unless your workloads can afford breaks.

Re: Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes

iyashk-DB — Tue, 25 Nov 2025 18:17:11 GMT

@ashishCh

The [CANNOT_OPEN_SOCKET] failures stem from PySpark’s default, socket‑based data transfer path used when collecting rows back to Python (e.g., .collect(), .first(), .take()), where the local handshake to a JVM‑opened ephemeral port on 127.0.0.1 intermittently times out or is refused.

This can happen due to Spot Instance termination/ Executor unresponsiveness due to memory/CPU pressure etc.

To mitigate this error, can you add the following Spark Configuration to your Job Compute Clusters:
spark.databricks.pyspark.useFileBasedCollect true

This switches the data transfer mechanism from sockets to temporary files, thereby avoiding reliance on the local network layer.