cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Facing CANNOT_OPEN_SOCKET error after job cluster fails to upsacle to target nodes

ashishCh
New Contributor II

Screenshot 2025-11-25 at 6.08.19โ€ฏPM.png
This error pops up in my Databricks workflow 1 out of 10 times, and everytime it occurs I see the below message in event logs.
โ€ƒCompute upsize complete, but below target size. The current worker count is 1, out of a target of 3.


And right after this my job cluster terminates with the socket error message.

Screenshot 2025-11-25 at 6.10.50โ€ฏPM.png

These are my cluster configs, if required.

Screenshot 2025-11-25 at 6.12.30โ€ฏPM.png

2 REPLIES 2

Coffee77
Contributor III

Difficult to know but maybe it has to do with usage of spot instances as it seems root cause is kind of random. In theory spot instances can be terminated at any time by cloud provider if it needs the capacity back, BUT databricks should handle this fact correctly to replace lost spot workers or apply resilient policies to avoid that type of errors.

So, I can't ensure that is your issue. However, you can try to disable that option for a given time taking into account that costs will be a little higher. In anycase, don't use "spot" instances in PROD unless your workloads can afford breaks.


Lifelong Learner Cloud & Data Solution Architect | https://www.youtube.com/@CafeConData

iyashk-DB
Databricks Employee
Databricks Employee

@ashishCh 

The [CANNOT_OPEN_SOCKET] failures stem from PySparkโ€™s default, socketโ€‘based data transfer path used when collecting rows back to Python (e.g., .collect(), .first(), .take()), where the local handshake to a JVMโ€‘opened ephemeral port on 127.0.0.1 intermittently times out or is refused. 

This can happen due to Spot Instance termination/ Executor unresponsiveness due to memory/CPU pressure etc.

To mitigate this error, can you add the following Spark Configuration to your Job Compute Clusters:
spark.databricks.pyspark.useFileBasedCollect true

This switches the data transfer mechanism from sockets to temporary files, thereby avoiding reliance on the local network layer.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now