stbjelcevic
Databricks Employee
Databricks Employee

+1 to @pradeep_singh 

The Workspace FUSE (WSFS) daemons use ports 1015, 1017, and 1021 for communication between the driver and the executor. NFS tooling (hardcoded in glibc) can race with these ports during cluster startup, causing FUSE daemons to fail to bind. This explains the intermittent nature, sometimes the port race doesn't happen and it works fine.

On interactive clusters, the driver accesses /Workspace via a local FUSE mount. On multi-node job clusters, executors must RPC to the driver over those ports (a fundamentally different code path).

Check your VPC security group rules to ensure all TCP ports are open between nodes in the same security group (if you are using a managed VPC).

View solution in original post